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SYSTEM AND METHOD FOR PROVIDING operation is performed. Finally, such rigid N+l redundancy 

HIGHLY AVAILABLE DATA STORAGE schemes have no way of "healing" themselves, that is, after 

USING GLOBALLY ADDRESSABLE one error the system in no longer N+l redundant. 

MEMORY Other software approaches to improv e the relia bility an d 

5 o peration of centralized structured storage network systems 

CROSS-REFERENCE TO RELATED nave generally involved: (1 ) st atic mapping7>f"mFBa ta"to 

APPLICATIONS one or more server s and associated d isks (sometim es 

This application is a continuation-in-part of co-pending reTerred tc , asJ^aTein^^ storing the 

U.S. patent application Ser. No. 08/754,481 filed Nov. 22, , ft ? ata m share <? da J* 'eposttory, such as a shared disk 

1996 and co-pending U.S. patent application Ser. No. 10 (someUmes referred to as shared everything;' clustering); 

08/827,534, filed Mar. 28, 1997. and < 3 ) database replication. 

Systems using the first method distribute portions of the 

TECHNICAL FIELD data store across a plurality of servers and associated disks. 

Each of the servers maintains a portion of the structured 

The present invention relates id general to distributed data 15 store of data, as well as optionally maintaining an associated 

storage systems ana, more specincauy ; io sjStems and tion of a dircctory structure that describes the portions of 

i^oas-tEatmaintain a highly available distribuTecTstore of (he da(a stored mat particular XTVel systems 

— guard against a loss of data by distributing the storage of 

n a r-t/rnnTiNTn r\m^nii* atthm data statically across a plurality of servers such that the 

BACKGROUND INFORMATION 20 r m r -n i. • i * i 

failure of any one server will result m a loss of only a portion 

Computer based structured storage systems, such as com- of the overall data. However, although known clustered 

puter file systems and database systems, have been remark- database technology can provide more fault tolerant opera - 

ably successful at providing users with quick and facile tion in that it guards against data loss and provides support 

access to enormous amounts of data. The importance of for dual-path disks, the known systems still rely on static 

these structured storage systems in today's commerce is 25 allocation of the data across various servers. Since data is 

difficult to exaggerate. For example, structured storage sys- not dynamically allocated between servers: (1) system 

tems have allowed businesses to generate and maintain resources are not allocated based on system usage which 

enormous stores of persistent data that the company can results in under utilization of those resources; (2) scaleable 

modify and update over the course of years. For many performance is limited because new servers must be pro- 

companies, this persistent data is a valuable capital asset that 30 vided whenever the dataset grows or whenever one particu- 

is employed each day to perform the company's core lar server cannot service requests made to its portion of the 

operations. The data can be, for example, computer files dataset; and (3) such static allocation still requires at least 

(e.g., source code, wordprocessing documents, etc.), data- one of servers storing the information to survive in order to 

base records and information (e.g., information on preserve the data. Also, failure of one server requires a 

employees, customers, and/or products), and/or Web pages. 35 second server to serve the data previously served by the 

Such data must be "highly available," i.e., the data must be down server, which degrades system performance, 

available despite system hardware or software failures, Systems using the second method store the data stored in 

because it is often used for day-to-day decision making a shared data repository, such as a shared disk. The shared 

processes. ^ disks may be shared between a subset of system nodes or 

Previous efforts to provide high availability or fault between all nodes of the system. Each node in the system 

tolerance have included both hardware techniques, such as continually updates the central data repository with its 

providing redundant systems, and software appro aches, such portion of the structured store. For example, in a database 

as redundant array of independent disks (RAID) technology system, each node exports tables it is currently using to the 

and clustering. Each one of these efforts has its own unique 45 data store. While this method exports the problems of load 

drawbacks. balancing to the central data repository, it suffers from two 

Redundant systems are typified by double or triple redun- main drawbacks. First, throughput is lowered because of 

dancy. These types of systems provide more than one increased overhead associated with ensuring coherency of 

complete machine to accomplish the task of one machine. the centralized data store. Second, locking is inefficient 

Each machine performs the same operations in parallel. If 50 because entire pages are locked when a node accesses any 

one machine fails or encounters an error, the additional portion of a page. As a result, nodes may experience 

machines provide the correct result. Such systems, while contention for memory even when no true conflict exists, 

highly tolerant of system faults, are extremely expensive. In Similar to disk mirroring, but at a higher level, are 

effect, multiple networks of machines must be provided to techniques based on database replication. These systems 

implement each network. 55 may provide replication of the data stores or of the trans- 

A similar fault-tolerant approach for storage is RAID. actions performed on the data stores. Accordingly, these 
RAID technology may be implemented as disk mirroring systems go further in guarding against the loss of data by 
(so-called RAID I) or disk striping with parity (so-called providing static redundancy within the structured storage 
RAID V). Disk mirroring provides highly fault tolerant system. However, such systems suffer from the same draw- 
storage but is expensive, since multiple disks, usually two, 60 hacks as other static techniques described above, 
must be provided to store the data of one disk. Disk striping Additionally, so-called "transaction-safe" replication tech- 
with parity has poor performance for intensive write niques suffer from scalability problems as the number of 
applications, since each time data is written to the array of tables served increases. 

disks a parity block must be calculated. Disk striping pro- SUMMARY OF THE INVENTION 

vides rigid N+l redundancy and suffers additional perfor- 65 oUMMAKi Uh IHb INVbJSUUN 

ma nee degradation after the first error since the missing The present invention relates to data storage systems that 

block (or blocks) must be recalculated each time a read are more reliable and provide greater fault tolerant operation 



05/07/2003, EAST Version: 1.03.0002 



5,909,540 

3 4 

than present data storage systems that suffer no performance The foregoing and other objects, aspects, features, and 

degradation when an error in encountered. The novel sys- advantages of the invention will become more apparent from 

terns described herein achieve self-healing N+l redundancy the following description and from the claims, 
for disk storage, RAM storage, and structured data storage 

by distributing system data and data structures throughout a 5 BRIEF DESCRIPTION OF THE DRAWINGS 
globally addressable memory space, a portion of which is 

hosted by one of more different nodes on a network. Because In *e drawings, like reference characters generally refer 

each node locally hosts system pages it is currently to the P arts throughout the different views. Also, the 

accessing, the system has the ability to dynamically move drawings are not necessarily to scale, emphasis instead 

data in response to network activity levels and access generally being placed upon illustrating the principles of the 

patterns in order to optimize performance and minimize invention. 

node access times. The system further provides distributed FIG. 1 is a conceptual block diagram of a distributed 

control for a plurality of different types of structured storage addressable shared memory structured data storage system 

systems, such as file systems, database systems, and systems according to the invention. 

that store, share, and deliver Web pages to requesting nodes. - . ... - , ,. „ c 

^ ' . xi r • • 15 FIG. 2 is a diagrammatic view of an embodiment for 

Hie system is further capable or repairing errors encoun- 1J , . n11ll . . „ , , „. 

. A . , r . .f j-.-i.j logically organizing network nodes; 

tered during operation because system data is distributed . c 

across network nodes. Appropriate data structures and oper- FIG - 3 ^ a diagram of one possible embodiment of the 

ating policies are provided that allow the system to identify system of FIG. 1, namely a distributed addressable shared 

when a node has damaged or missing information. The memory file system providing storage for computer files 

information can be located, or regenerated, and is redistrib- 20 such as source code files, wordprocessing documents files, 

uted to other nodes on the network to return the system to etc. 

N+l redundancy. Optionally, a shared memory system can FIG. 4 is a graphical representation of the organization of 

be employed, such as a distributed shared memory system directory entries and associated file descriptors (also known 

(DSM) that distributes the storage of data across some or all ^ «i no des"), suitable for use with the file system of FIG. 3. 

of the memory devices connected to a network. Memory 05 ^ ■ j- r T j - A L * r 

devices that may be connected to the network include hard fll FIG * 5 18 * of an Inode sultablc for ^ the 

disk drives, tape drives, floppy disk drive, CD-ROM drives, file svstem of mG - 3 - 

optical disk drives, random access memory chips, or read- FIG. 6 is a flowchart of the steps to be taken to update file 

only memory chips. system metadata. 

In one aspect, the invention relates to a method for 30 FIG. 7 illustrates a distributed shared memory computer 

continuing operation after a node failure in a system for network. 

providing distributed control over data A number of nodes FIG. 8 is a functional block diagram that illustrates in 

are inter-connected by a network and he nodes periodically mow ^ Qne ^trtoxted shared memory computer net- 

exchange connectivity information. Stored on each node is i r .u * u 

an instance of a data control program for manipulating data. work of me ^ e shown in 7 " 

Accordingly, multiple, distributed instances of the data 35 FIG 9 illustrates in more detail a shared memory sub- 
control program exist throughout the network. Each instance system suitable for practice with the network illustrated in 
of the data control program interfaces to a distributed shared FIG. 8. 

memory system that provides distributed storage across the fig. 10 is a functional block diagram of one shared 

inter-connected nodes and that provides addressable persis- memory subsystem according to the invention. 

tent storage of data. Each instance of the data control 40 11 . „ , a ... _ . . , , 

4 , 4 * * . • , FIG. 11 illustrates a directory page that can be provided 

program is operated to employ the shared memory system as , , , , t J % Jr , _, . . K . 

a memory device having data contained therein. The shared ^ a memor y System of the type deptcted in FIG. 

memory system coordinates access to the data to provide " 

distributed control over the data. Exchanged connectivity FIG. 12 illustrates a directory that can be distributed 

information is used to determine the failure of a node. Once 45 within a shared memory and formed of directory pages of 

a node failure is recognized, the portion of the data for which the type illustrated in FIG. 11. 

the failed node was responsible is determined. FIG. 13 illustrates in functional block diagram form a 
In another aspect, the invention relates to a method for system that employs a directory according to FIG. 12 for 
continuing operation after a node failure in a system for tracking portions of a distributed shared memory, 
providing distributed control over data. A number of nodes 50 
are inter-connected by a network and the nodes periodically 
exchange connectivity information. Stored on each node is A network system 10 according to the invention includes 
an instance of a data control program for manipulating data. a plurality of network nodes that access a memory space 
Accordingly, multiple, distributed instances of the data storing a structured store of data, such as a structured file 
control program exist throughout the network. Each instance 55 system or a database. Each of the nodes includes at least a 
of the data control program interfaces to a globally addres- data control program which accesses and manages the 
sable data store that provides distributed storage across the structured store of data. The structured store of data may be 
inter-connected nodes and that provides addressable persis- stored in an addressable shared memory or the structured 
tent storage of data. Each instance of the data control store may be stored in a more traditional fashion. For 
program is operated to employ the globally addressable data 60 example, each node may be responsible for storing a par- 
store as a memory device having data contained therein. The ticular element or elements of the structured store of data. In 
globally addressable data store coordinates access to the data such an embodiment, the data control program can access a 
to provide distributed control over the data. Exchanged desired portion of the structured store using a globally 
connectivity information is used to determine the failure of unique identifier. The underlying system would translate the 
a node. Once a node failure is recognized, the portion of the 65 identifier into one or more commands for accessing the 
data for which the failed node was responsible is deter- desired data, including network transfer commands. In 
mined. another embodiment, the structured store of data is stored in 



DESCRIPTION 



05/07/2003, EAST Version: 1.03.0002 



5,909,540 

5 6 

an addressable shared memory space, which allows the should oot be directly used. In one embodiment, a unique 

nodes to transparendy access portions of the structured store identification code is assigned to each node Xla-d in the 

using standard memory access commands. network. The identification code assigned to each node 

The system 10 can be a file system, a database system, a should not change. 

Web server, an object repository system, or any other 5 A network as described throughout the specification, may 

structured storage system that maintains an organized set of include many thousands of nodes that are geographically 

data. As used herein, the term "Web server" means any dispersed or located on distinct networks. Maintaining a flat 

processor that transmits data objects (such as Active X list of nodes for such a network topology results in ao 

objects), applications (such as JAVA applets), or files (such extreme amount of list maintenance overhead. Therefore, it 

as HTML files), to a requester via Web protocols (e.g., http 10 is generally desirable to add some structure to the node list 

or ftp). In one disclosed embodiment, the system 10 is a file in order to reduce maintenance overhead, 

system that maintains various computer files. However, this Referring to the embodiment depicted in FIG. 2, nodes 

is just one embodiment of the invention that is provided for V2a~e are collected into groups of nodes 52, 54 that may be 

illustrative purposes. The invention can- be employed to defined to reflect various network topologies. Groups of 

provide any one of a plurality of structured storage systems 15 nodes may also be grouped. This leads to a tree-structured 

(e.g., database system, Web page system, Intranet, etc.). The hierarchy of nodes and groups. There is one "root" group 56 

invention is not to be limited to the file system or other that includes as members every group and node present in 

particular embodiments described herein. the network. Further efficiencies may be achieved by lim- 

Referring to FIG. 1, a network system 10 according to the iting group size to a predetermined number of nodes, 

invention includes a plurality of network nodes \2a~\2d and 20 In the embodiment described by FIG. 2, group member- 

an addressable shared memory space 20 that has a portion 22 ship is expected to change infrequendy, if at all. In general, 

for storing a structured store of data 28. Each of the nodes when a node 12a-e is introduced into the network, it is 

12a-12d can include several sub-elements. For example, configured into a particular group 52, 54, and the node's 

node 12a includes a processor 30a, a data control program group affiliation should change only as a result of a com- 

32a, and a shared memory subsystem 34a. In the disclosed 25 mand issued by the network administrator, 

embodiment, two of the nodes, 12a and 12c, include moni- p or example, for embodiments in which nodes are 

tors that provide displays 40 and 42 graphically depicting grouped and the number of nodes belonging to any one 

the structured store of data 28 within the addressable shared groU p {$ bounded, two forms of identification may be 

memory space 20. The addressable shared memory space 20 ^ assigned to each node. A short form of identification may be 

interconnects each of the network nodes \2a~Md and pro- assigned that encodes grouping information and therefore 

vides each node Ma-Md with access to the structured store may change, however infrequently, with network topology 

of data 28 contained within the addressable shared memory or logical organization. Also, a longer form of identification 

space 20. , ma y oe assigned to each node that is guaranteed to remain 

A system 10 according to the invention can provide, 35 unchanged. The latter form is primarily used to refer to each 

among other things, each network node 12a-12d with node \2a-e in the network. 

shared control over the structured store of data 28 and, For example, a group of nodes 52, 54 may be limited to 

therefore, the system 10 can distribute control of the data 64 members, requiring 6 bits to encode the identification 

store across the nodes of the network. To this end, each node information for each node. Accordingly, 12 bits would allow 

of the system 10, such as node 12a, includes a data control ^ the system to uniquely identify any node in the network 

program 32a that interfaces to a shared memory subsystem when more than one group of nodes exists, up to a maximum 

34a. The data control program 32a can operate as a struc- 0 f 64 groups. For networks in which more than 64 groups of 

tured storage system, such as a file system, that is adapted to nodes exist, groups of nodes must be themselves grouped 

maintain a structured store of data and to employ the shared and 18 bits would be required to uniquely identify any node 

memory system as an addressable memory device that can 45 m the system. 

store a structured store of data. At the direction of the data Each node 12fl _^ may also be assigned a permanent 

control program 32a, the shared memory subsystem 34a can identification code that is invariant for the life of the node, 

access and store data within the addressable shared memory p ermaiieil t identification codes may be constructed using a 

space 20. These cooperating elements provide a structured global addr&ss compone nt to make it unique in space and a 

storage system that has a distributed architecture and thereby 5Q date Qr ^ compoaent t0 make it un iquc m time . TtiuSt m 

achieves greater fault tolerance, reliability, and flexibility this embodiment, a node's permanent identification code 

than known structured storage systems that rely on central- wi]1 both of an address and a time _ stamp to ensure 

ized control and centralized servers. Accordingly, the inven- mat ^ ^ 

tion can provide computer networks with distribu lively Referring once again to FIG. 1, the system 10 of the 

controlled and readily scaled file systems, database systems, 5S invention maintains wit^ ^ addressable shared memory 

Web page systems, object repositories, data caching 20 a structured store of data 28 Each of the Qodes 

systems, or any other structured storage system. Xfr-Md can access the addressable shared memory space 

It is necessary to maintain a list of all nodes present in the 2 n through the shared memory subsystems 34a-34d. Each 

network (12a-o* in FIG. 1) and interconnection information 0 f me shared memory subsystems 34a-34a" provides its node 

for the nodes in order to provide various system function- 60 ^th aC cess to the addressable shared memory space 20. The 

alities. In one embodiment, the node information is used to shared memory subsystems 34a-34d coordinate each of the 

provide a level of failure recovery, which will be described respective node's memory access operations to provide 

in more detail below. access to the desired data and maintain data coherency 

Any data structure for maintaining a list of nodes may be within the addressable shared memory space 20. This allows 

used so long as the list remains relatively compact. It is also 65 the interconnected nodes 12a-12d to employ the addressable 

desirable that the list of network nodes is independent of shared memory space 20 as a space for storing and retrieving 

network technology, which means that network addresses data. At least a portion of the addressable shared memory 
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space 20 is supported by a physical memory system that 50a within the data store 28 at the proper location, 

provides persistent storage of data. For example, a portion of Moreover, the shared memory subsystem 34c on node 12c 

the addressable shared memory space 20 can be assigned or detects the change within the data store 28 and reflects that 

mapped to one or more hard disk drives that are on the change within the graphical user interface 40. 

network or associated with one or more of the network nodes 5 Referring now to mG 3 a sl mctured file system 60 is a 

\2a-\2d as loca hard disk storage for those particular ]af embodiment a 

ccording to the invention that 

nodes Accordingly, FIG. 1 illustrates tot systems of be £ ^ ^ ^ addressable shared m 

mvenuon have shared memory subsystems providing the ^ 2 „ {o . en( wha( ^ (o ^ network ^ ^ 

networknodeswith access to an addressable shared memory a cohere flu whea jn ^ ft ^ 

space, wherein at least a portion of that space * assigned to ,„ network nQdes d ^ ^ addressable shared m 

at least a portion or one or more or the persistent storage 20 

memory devices (e.g., hard disks) to allow the nodes addres- s P ace 

sably to store and retrieve data to and from the one or more The file s y stem 60 of ™- 3 differs from known physical 

persistent storage memory devices. Apreferred embodiment and distributed file systems in a variety of ways. In contrast 

of such an addressable shared memory space is described in 15 t0 ^ own physical file systems that map a file organization 

the commonly-owned U.S. patent application Ser. No. onto disk blocks, me file system 60 according to the mven- 

08/754,481 filed Nov. 22, 1996, and incorporated by refer- hon manages the mapping of a directory and file structure 

ence above on *° a distributed addressable shared memory system 20 

™ ~ i' *• c+u *• r • *u + which has at least a portion of its addressable space mapped 

Therefore, one realization of the present invention is that . , * . * . : . 

u ctu a 11 ii j „ ^ i_ -.^~„~t:„„ or assigned to at least a portion of one or more persistent 

each of the nodes 12a~12d can employ its respective shared 2 q , , • / . j j- i \ . 1 n i-i 

memory subsystem as a memory device that provides per- f^ 6 (?*' hard , dBks )°" he n6 , two * n U °^ 6 

sistent data storage. ****** file s y st6ms ' the p fy st L cm L 6 ? of the 

r- L c l j f . , «j. c. invention employs peer nodes, each of which have an 

Each of the data control programs 32a-3M is a software incarnation J^ C6 of me ^ data program . 

module that couples to the respective shared memory sub- ^ ^ ems ^ ^ fi fc 6Q 

system 34a-34d m a way tha operates similarly to an 25 rf ^ inventioD: 4.^^ network 

interface between a conventional data storage program and automatically replicates data for redundancy and 

a local memory device. For example the data control ^ ^ automaticall y and dynamically migrates data 

program 32a can stream data to and collect data from, the (o for n6twork an(J traffic 

shared memory subsystem 34a. Because the shared memory and 6s a varf rf o , her advant and advan 

subsystems coordinate the memory accesses to the addres- 30 some P of which m m the c^^y^ed U5 . 

sable shared memory space 20, each of the data control ap pU ca tion Ser. No. 08/754,481 filed Nov. 22, 1996, 

programs is relieved from having to manage and coordinate ^ inc( ^ orated b refcrence abov6 

its activities with the other data control programs on the t * . J . . 

network or from having to manage and coordinate its Still referring to FIG. 3, the file system 60 resides in part 

activities with one or more central servers. Accordingly, 35 within the addressable shared memory space 20 and 

each of the data control programs 32a-32d can be a peer ^es * structured store of data 62, a super root 64 file 

incarnation (i.e., an instance) residing on a different one of sets 6 *" 74 > directory entry 80, and file or document 82 Two 

the network nodes 12a-12d and can treat the respective netwo * nodes 84 and 86 are shown accessing the addres- 

shared memory subsystem 34a-34rf as a local memory sable shared memory space 20 (in the manner described 

device such as a local hard disk. 40 Piously with reference to FIG. 1) via the logical drives 90 

„ r , * . t , n i'i a rto*i and 94. Application programs 92 and 96 executmg on the 

One or more of the data control programs 32a-32a can . Y * • * 6 , \ 

•j ... . t f A<j*u n ?~ l- n11tI A»«;„tr nodes interact with the data control programs (not shown in 

provide a graphical user interface 42 that graphically depicts „ . A . . rj „ * ~~ j + . A t 

. * j . P ,, - e , • a -*u- *ul «aa~^ FIG. 3 but shown m FIG. 1 as 32a-$2d) and cause the data 

the structured store of data 28 contained within the addres- . 1 . . , . ftn 

U1 . , - n t, . A • t£i ^_ Q control programs m the nodes to access the logical drives 90 

sable shared memory space 20. The graphical user interface ^ f . . . ? , nM 

lt . j r 1 * ^ i<>„ t« >,c and 94. In the disclosed embodiment, the logical drives are 

42 allows a user at a node, for example at node 12a, to insert 45 M „ 7 6 . 

- , , . t u- 11 -.u- ,u . * 1 - r , , DOS devices that "connect to the fileset directories via 

data objects graphically within the structured store of data . 

1C ™ t J , . & , , : ^ f _, t ^ M « 0 .ot 0 Installable File System drivers associated with the file 

28. To this end, the data control program 32a can generate ^ 7 

a set of commands that will present a stream of data to the svstem 

shared memory subsystem 34a and the shared memory file system 60 supports one global file system per 

subsystem 34a will employ the data stream to store an object 50 addressable shared memory space 20 shared by all of the 

within the structured store of data 28. Similarly, the other network nodes. This global file system is organized mto one 

shared memory subsystems 34fc-34fi can provide informa- or more independent collections of files, depicted as the 

tion to their respective nodes that is indicative of this change Assets 66-74. A fileset can be considered logically equiva- 

to the structured store of data 28. Accordingly, as shown lent to a traditional file system partition. It is a collection of 

depicted in FIG, I for node I2c only for simplicity, that node 55 files organized hierarchically as a directory tree structure 

(which includes a graphical user interface 40) reflects the rooted in a root directory. The non-leaf nodes in the tree are 

change to the structured store of data 28 affected by the data the directories 80, and the leaves in the tree are regular files 

control program 32a of the node 12a. In particular, the & or empty directories. Sub-directory trees within a fileset 

graphical user interface 40 of the node 12c can depict to a can overlap by Unking a file to multiple directones. 

user that an object is being placed within the structured store 60 A benefit of breaking up the file system 60 into filesets 

of data 28. For example, the addressable shared memory 66-74 is that it provides more flexible file system manage- 

space 20 also contains the data objects 50a-50c which can ment for users of the system 60. As the file system 60 grows 

be placed within the structured data store 28 to become part into very large sizes (e.g., hundreds of nodes with thousands 

of that structured data store. As illustrated, a system user at of gigabits of storage), it is . desirable to have the files 

node 12a can direct object 50a to be inserted at a set location 65 organized into groups of management entities such that 

within the data store 28. The data control program 32a then management actions can be independently applied to indi- 

directs the shared memory subsystem 34a to place the object vidual groups without affecting the operation of the others. 
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The filesets in the addressable shared memory space 20 data stream in the fileset. In order to ready a fileset for 
are described and enumerated in a common structure, the deletion, the fileset must be "shutdown" by putting it off- 
root 64 of which provides the starting point to locate the line, 
filesets in the addressable shared memory space 20. The root Fileset Enumeration 

64 can be stored in a static and well-known memory location 5 This operation enumerates a specific fileset, or all the 

in the addressable shared memory space 20, and it can be filesets, in the addressable shared memory space 20. 

accessed via a distributed shared memory system program Fileset Control 

interface. When a node is accessing a fileset for the first time, This operation performs fileset level control routines such 

it first looks up the root 64 to determine the identifier as setting fileset attributes, 

associated with the fileset, e.g., the shared memory address 10 Mount Export Control 

used to access the fileset. Once it has determined the Directory are attached to local devices, i.e., "mounted" 

identifier, the node can access the root directory of the using parameters stored in the Windows NT registry, or 

fileset. From the root directory, it then can traverse the entire some other similar central storage- are a for such information, 

fileset directory tree to locate the desired file. Filesets used When first started up, the data control program 60 accesses 

by the file system 60 are described in greater detail below 15 the central storage and determines which filesets should be 

under the heading "Fileset." mounted. The data control program creates a file object 

Referring to FIG. 4, in the disclosed embodiment of the representing each fileset identified by the entries in the 

file system 60 according to the invention, a directory 126 central storage. In some embodiments an API may be 

(such as the directory 80 of FIG. 2) is accessed by starting provided which allows the data control program 60 to 

at a directory Inode or descriptor 128 containing an address 20 dynamically mount and unmount filesets by making appro- 

that points to a directory entries stream descriptor 130. This priate API calls. 

descriptor 130 is a pointer to a block of data containing The users of the file system 60 are not aware of the shared 

directory entries for files File 1 through File 3. The directory memory "logical volume," but rather view each fileset as a 

entry for File 1 has a number of entries; one of the entries volume (or partition in the sense of a traditional physical file 

is a string containing the name of the file and another entry 25 system). The Win32 GetVolumelnformation is used to get 

is the address of the Inodes and stream descriptors 132. The information on the fileset (more precisely, on the logical 

stream descriptors for File 1 are used to locate and retrieve device on which the fileset is attached to). Because all the 

the various 4 kilobyte pages in the addressable shared filesets share the same pool of the storage in the addressable 

memory space 20 that constitute File 1. Other files are shared memory space 20, the total volume size returned to 

retrieved and constructed from the addressable shared 30 the user for each fileset is the current aggregate storage 

memory space 20 in the same fashion. The directories used capacity in the addressable shared memory space 20. The 

by the file system 60 are described in greater detail below same approach is taken for the total free space information, 

under the heading "Directory." * and the aggregate value of the addressable shared memory 

In the embodiment of the file system 60 disclosed in FIG. s P ace 20 is returned for each fileset. 

5, a file 98 (such as the file 82 of FIG. 3) is represented by 35 DIRECTORY 

one or more shared pages of data 100, 102, 104, 106, and _ . i . . 4 , . 

mo- * L i j ui t j in c uei nc Directory entry scanning is one of the most frequently 

108 in the addressable shared memory space 20. Each file 98 r , J f. , & r .. T * • i u 

i_ r. i j j i ■ * ha *u ♦ • i fii~ performed operations by user applications. It is also may be 

has a file Inode or descriptor 110 that includes various file f. , ** ., . ■ * r c 

... ... „ C1 / . . .. A . AA the most visible operation in terms of performance. 

attributes 112. The file descriptor 110 contains an address _ n , * 4 . . . , * , . 

iLi - A * .. , j*u^.40 Consequently, much attention is directed to making the 

that points to a data stream descriptor 114, and the data w ^ a- ■ . A *u ym a wt n„ c.t,*™ 

^ • . i j a a 11 e i->n directory scan efficient and the WindowsNT Files System 

stream itself includes one or more addresses 116, 118, 120, n , nT£? v ; , r . „- . 4 fll T , . ^ „ J > 

+ ™ j-i^.t. . • i . ,i .j *. ■ jc li (NTFS) duplicates sufficient file Inode information in the 

122, and 124 that pomt to particular pages in the identifiable \. ' * , . , , # . , 

* j <m t *u a- i „ a *, m u~A; ma „t o directory entry such that a read directory operation can be 

shared memory space 20. In the disclosed embodiment, a . _ j ' 3 , . t . ... 

. , J F . AA U1 . , 7 satisfied by scanning and reading the directory entries with- 

page is the atomic unit m the addressable shared memory . 3 & , , . _ & ,. P i fi1 , , 

™ j • j 4 • 4 a 1 m . c a * c ' f 4< out gome out to read the information from the file Inodes. 

space 20, and it contains up to 4 kilobytes of data. Even it ^ _ & * . . , . t . 4 , , , . , C1 

,l ^ \ 1 L , • , j J . , ™ . The problem with this scheme is that the doubly stored file 

the entire 4 kbytes is not needed, an entire page is used. This 4 r A A , . £1 . , j ct • u 

. ... 4 t , ^ , ino , ( i , ■ u * n metadata, such as the file time stamps and file size, can be 

is illustrated by the page 108 that only contains about 2 , A A ' . _ . . . tU * t A . A \ 

1W . ,/„ F ,. 6 j u *i, ci * *n updated quite frequently, making the metadata update more 

kbytes of data. The files used by the file system 60 are y • rr *u- u a • a a »li 

j , . , j.m.i j .t i ^c-, „ expensive. However, this overhead is considered acceptable 

described in greater detail below under the beading Files. . * ■ j • j- « 

& 50 in face of the performance gained m directory scan opera- 

FILESET tions. 

The filesets are the basic unit for the file system 60. Each The file system 60 adopts the same philosophy of pro- 
fileset is identified with a name having up to 255 characters. viding efficient directory scanning by duplicating file Inode 
The file system 60 exports a set of fileset level operations information in directory entries. Each directory entry con- 
that allow an administrator to manage the filesets through 55 tains sufficient information to satisfy the Win32 query file 
the following type of actions. information requests. The file Inode is stored with the file 
Fileset Creation stream descriptors on a separate page. The Inode is located 

This operation creates a new fileset. The fileset is initially via a pointer in the directory entry, 

created with one file, the empty root directory. A default The file system's directory entries are stored in the 

fileset is created automatically at the initialization of the 60 directory file's directory entry data stream. To maximize 

addressable shared memory space 20. space utilization, each directory entry is allocated on the first 

Fileset Deletion available free space in a page that can hold the entire entry. 

This operation deletes a fileset. All files in the fileset are The length of the entry varies depending on the length of the 

removed, and all shared memory space allocated to the files file's primary name. The following information is part of the 

in the fileset is discarded and the backing physical storage 65 directory entry: creation time; change time; last write time; 

freed for new storage. The file system 60 will only allow last accessed time; pointers to stream descriptor; pointer to 

deletion of a fileset until there are no open handles to file parent directory Inode; MS-DOS type file attributes; and 
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MS-DOS style file name (8.3 naming convention). For identified to the users through file handles. A file handle is 

average file name lengths, a page contains up to about 30 a 32-bit entity representing an instance of an open file 

entries. All the file information in the directory entry is also stream. For example, WindowsNT creates the file object and 

contained in the file Inode, except for the file primary name returns a file handle to the users in response to the user 

and MS-DOS file name. The file primary names and asso- 5 request for file creation or file open. The file system 60 

ciated short names are only stored in the directory entries. initializes a pointer to a file control block. Multiple file 

This makes the Inode size fixed. ob J ects P oiot to same me 00111101 bloc K ^ ^ ^ 

„. . c . .vc , / . f c , control block maintains separate stream objects for each 

When a file information is modified (except for file context Extcmall t £ e filc handlc ^ to the 

names), the Inode is updated in the context of the update ^ cafl be {ssued ^ ^ ^ file 

transacuon and therefore always contains the most up-to- 10 WheQ ^ ^ ^ & ^ ^ m& and ^ &ssoM 

date informaUon. The associated directory entry change is ^ handle is removed 

lazily flushed to reduce the cost of double updating. This _ t ' . , - 

- r j , . , r, . i , . The file system 60 maps file streams mto sequences of 

means the Inode updates are either flushed or recoverable, • , , r . , . \ 

, t „ iL • . ,. „ j . TC segments which become progressively larger, each segment 

but not the corresponding directory entry updates. If the 6 , . r & ' ™ & ♦ £n 

4 , / 4 g , *u t a / u *u n corresponds to one or more pages. The file system 60 

directory entry gets out of synch with the Inode (when the ^ f e / u . 

TJ * 7 . ° 1 11 o u j u ♦ **u j- * attempts to reserve contiguous pages for data streams but 

Inode change is successfully flushed but not the directory , y n i i_ i • * jju- 

. x * t . j j ,« . T , / only allocates real backing storage on an as needed basis, 

change), the entry is updated the next time the Inode is * i* * «i * ■ * au **• 

j f ^ t j r. . , ... r . usually as a result of a file extension requested by writing 

updated. In order to facilitate synchronization of directory , , . , 4 t t1 . C1 J t . & 

„ a * <u~ a'^^^, /T _ „„„ nni cnnn m „u:^i beyond the data stream allocation size. When a file extension 

updates, the directory entries ( modes) can not span multiple J . iL „. A rn , A 

™^ i .,, , , ,l • /• ^j- . ** on request is received, the file system 60 rounds the extension 

pages. FIG. 4 illustrates the organization of directory entries 20 . > f 

j ■ * j t -i size in number of bytes up to a multiple or 4 kilobytes to 

and associated modes. . . . L r j r 

make it an mteger number of pages, and requests pages tor 

PILES actual allocation. The number of 4 kilobyte pages allocated 

_ by the file system depends on the number of file extension 

Afile of the file system 60 comprises streams of data and requests made ^ fik system 60 allocate Qne 4 

the file system metadata to describe the file. Files are » pa g e for the first extension request, two 4 kilobyte pages for 

described in the file system 60 by objects called Inodes. The the rcquest? fouf 4 pages for ^ third 

Inode is a data structure that stores the file metadata. It extension requesty and so on _ ^ newly allocated pa g es are 

represents the file in the file system 60. ^ fi]led By reserving con tig U ous pages, the file system 60 

A data stream is a logically contiguous stream of bytes. It ^ can re duce the amount of bookkeeping information on the 

can be the data stored by applications or the internal infor- t> y(e 0 gs et to page ma p pm g. The file system 60 reserves 

mation stored by the file system 60. The data streams are (sometimes much) larger than requested memory space for 

mapped onto pages allocated from the addressable shared a file> and substantiates the storage by allocating backing 

memory space 20 for storage. The file system 60 segments storage page by page. 

a data stream into a sequence of 4 kilobyte segments, each 35 Fcmr mobyic allocation segments are chosen to reduce 

segment corresponding to a page. The file system 60 main- ^ umssc6 storage space ^ yet provide a rcasonable 

tains two pieces of size . information per data stream: the a u ocat ion size for usual file extensions. Since allocation is 

number of bytes in the data stream; and the allocation size an expensive opcT2Ltion ( most likely involving distributed 

m number of pages. The byte-stream to segment/page map- operations), smaller allocation size is not efficient. Larger 

ping information is part of the file metadata and is stored in ^ aUocation size wouki kad to meffic i en t space utilization, or 

a structure called data stream descriptor. See FIG. 5. additional complexity to manage unused space. A4 kilobyte 

Users' requests for data are specified in terms of range of segment also maps naturally to a page, simplifying the data 

bytes and the position of the starting byte measured by its stream segment to page mapping. Although an analogy 

offset from the beginning of the data stream, byte position cou i d be made with the NTFS's allocation policy of 4 

zero. The file system 60 maps the offset into the page 45 kilobyte clusters (segment) size for large disks to speed up 

containing the starting byte and the intra-page offset from allocation and reduce fragmentation, such analogy is not 

the beginning of the page. completely valid because the actual on-disk allocation seg- 

Every file of the file system 60 has at least two data ment size depends greatly on the local disk size and the 

streams: the default data stream; and the Access Control List physical file systems. 

(ACL) stream. Each file may optionally have other data 50 Similar to the NTFS, which controls the allocation of each 

streams. The ACL stream is used to store the security Access d isk partition and therefore can quickly determine the free 

Control Lists set on the file. Each data stream is individually volume space available for allocation, the file system 60 

named so that the user can create or open access to a specific requests the total available space information and uses this 

data stream. The name of the default data stream is assumed information to quickly determine whether to proceed with 

to be the primary name of the file. To access a data stream, ss the allocation processing. If the total available space is less 

the user of the file system 60 must first open a file handle to than the required allocation size, the request is denied 

the desired data stream by name. Handle to the default data immediately. Otherwise, the file system 60 will proceed to 

stream handle to the default data stream is opened. This open allocate the pages to satisfy the request. The fact that the file 

file handle represents the data stream in all the file system system 60 can proceed with the allocation does not guaran- 

services that operates on the data stream. 60 tee that the allocation will succeed, because the actual total 

The file system 60 exports a set of services to operate at available space may change constantly, 

the file level. The input to the services are the file object The file system 60 takes advantage of the page level 

handle (Inode) or the data stream object handle, and the replication capability of the underlying distributed addres- 

operation specific parameters, including the desired portions sable shared memory system 20 disclosed in the U.S. patent 

of the data stream in byte positions. 65 application incorporated by reference above. Page level 

Open files are represented by data stream objects (or just replication allows the system to provide file replication. The 

file objects). Users access files using these file objects, data streams of a replicated file are backed by pages, which 
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are themselves replicated. In this way, data streams are 
replicated automatically without intervention of the file 
system 60. The extra space consumed by the multiple 
replicas is not reflected in the file (data stream) sizes. The 
stream allocation size still reports the total allocation size in 5 
pages required for one replica. The pages backing temporary 
files, however, are not replicated. 

FILE ACCESS AND RESOURCE SHARING — 
LOCKING 

10 

The shared memory provides the distribution mechanism 
for resource sharing among peer nodes running the file 
system 60 software. Each instance of the file system 60 on 
each network node views the shared memory resources (i.e., 
pages) as being shared with other local or remote threads. ^ 
The file system 60 needs a way to implement high level, file 
system locks to provide consistent resource snaring. Any 
concurrency control structure can be used to implement 
locks, such as lock objects or semaphores. In database 
applications, locking may also be achieved by implementing 2 o 
concurrency control structures associated with database 
indices or keys. In file system applications access to files or 
directories may be controlled. Another example of file 
system locks is Byte Range Locking, which provides the 
users the ability to coordinate shared access to files. A byte 25 
range lock is a lock set on a range of bytes of a file. 
Coordinated shared access to a file can be accomplished by 
taking locks on the desired byte ranges. In general, the high 
level file system lock works in the following fashion: (a) a 
file system resource is to be shared by each file system 60 30 
instance, and the access to the resource is coordinated by a 
locking protocol using a lock object data structure that 
represents the high level lock to coordinate the shared 
resource, and it is the value of the data structure that 
represents the current state of the lock; (b) to access the 35 
resource, the instance at each node must be able to look at 
the state (or value) of the lock data structure, and if it is 
"free," modify it so that it becomes "busy," but if it is 
"busy," then it has to wait to become "free," and there could 
be intermediate states between "free" and "busy" (i.e., more 40 
than two lock states), but in any event, in this byte range 
locking example, a lock is a description of a certain byte 
range being shared/exclusively locked by some thread of the 
file system 60, and a conflicting new byte range lock request 
that falls in or overlaps the already locked byte range will be 
denied or the requester may block (depending on how the 
request was made); and (c) access to or modification of the 
lock data structure by each node's instance needs to be 
serialized so that it in turn can then be used to coordinate 
high level resource sharing. 

The locking features and capabilities of the shared 
memory engine described in the U.S. patent application Ser. 
No. 08/754,481, incorporated by reference above, allow the 
file system 60 to coordinate access to pages. The engine can 
also be used to coordinate access to resources, but in the case 
of complex high level resource locking such as Byte Range 
Locking, using the engine's locking features and capabilities 
directly to provide locks may be too costly for the following 
reasons: (a) each byte range lock would require a page 
representing the lock, and since the number of byte range 
locks can be large, the cost in terms of page consumption 
may be too high; and (b) the engine locks only provide two 
lock states (i.e., shared and exclusive), and high level file 
system locks may require more lock states. 

The file system 60 of the invention implements the file 
system locking using the engine locking as a primitive to 
provide serialization to access and update the lock data 



structures. To read a lock structure, the file system 60 takes 
a shared lock on the data structure's page using the engine 
locking features and capabilities before it reads the page to 
prevent the data structure being modified. To modify the 
lock structure, it sets a exclusive lock on the page. The page 
lock is taken and released as soon as the lock structure value 
is read or modified. 

With the serialization provided by the page locking and 
the page invalidation notification, the file system 60 imple- 
ments the high level locks in the following way: (a) to take 
a file system lock (FS lock), the file system 60 sets a shared 
lock on the FS lock page and reads the page and then 
examines the lock structure; (b) if the lock structure indi- 
cates the resource is unlocked or locked in compatible lock 
mode, then the file system 60 requests to exclusively lock 
the page, and this guarantees only one file system 60 node 
instance can modify the lock data structure, and if the 
request succeeds then the file system 60 write maps the lock 
page and then changes the lock structure to set the lock and 
unlocks the page and sets page access to none; and (c) if the 
resource is locked in incompatible lock mode, the file system 
60 unlocks the page but retains the page read mapped, and 
it then puts itself (the current thread) in a queue and waits for 
a system event notifying that the lock value has changed, 
and when the lock value does change then the file system 60 
thread gets notified and repeats the step (a) above. The file 
system 60 implements the notification using a signal primi- 
tive. The file system 60 threads waiting for a lock are 
blocked on a system event. When the page containing the 
lock changes, a signal is sent to each blocked file system 60 
thread. Each blocked file system 60 threads then wakes up 
and repeats step (a). FS locks are stored in volatile pages. 

FILE ACCESS AND RESOURCE SHARING— 
BYTE RANGE LOCKING 



45 



50 



Byte Range Locking is a file system locking service 
exported to the users through the Win32 LockFile( ) and 
LockFileEx( ) API. It allows simultaneous access to different 
non-overlapping regions of a file data stream by multiple 
users. To access the data stream, the user locks the region 
(byte range) of the file to gain exclusive or shared read 
access to the region. 

The file system 60 supports byte range locking for each 
individual data stream of the file. Trie following Win32-style 
byte range locking behavior is supported: (a) locking a 
region of a file is used to acquire shared or exclusive access 
to the specified region of the file, and the file system 60 will 
track byte range locks by file handle, therefore file handles 
provide a way to identify uniquely the owner of the lock; (b) 
locking a region that goes beyond the current end-of-file 
position is not an error; (c) locking a portion of a file for 
exclusive access denies all other processes both read and 
write access to the specified region of the file, and locking 
55 a portion of a file for shared access denies all other processes 
write access to the specified region of the file but allows 
other processes to read the locked region, and this means 
that the file system 60 must check byte range locks set on the 
data stream not only for lock requests but for every read or 
60 write access; (d) if an exclusive lock is requested for a region 
that is already locked either shared or exclusively by other 
threads, the request blocks or fails immediately depending 
on the calling option specified.; and (e) locks may not 
overlap an existing locked region of the file. 
65 For each byte range lock, the file system 60 creates a byte 
range lock record to represent the lock. The record contains 
the following information: (a) byte range; (b) lock mode 
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(shared or exclusive); (c) process identification; and (d) a 
Win32 lock key value. 

The file system 60 regards the file byte ranges as resources 
with controlled access. For each byte range lock record, the 
file system 60 creates a file system lock (as discussed above) 5 
to coordinate the access to the byte range "resource." A 
compatible byte range lock request (share lock) translates 
into taking read lock on the file system lock associated with 
the byte range record. An exclusive byte range lock request 
is mapped to taking write lock on the file system lock. 10 

Using the file system locking mechanism discussed 
above, lock requests waiting on the page containing the 
desired byte range will be notified when the page content 
changes. 

Having described in some detail a particular embodiment 
of the invention, namely the file system 60, a brief summary 
of the disclosure on that file system 60 is now presented in 
the following three paragraphs. 

The file system 60 views the addressable shared memory 20 
space 20 as a shared flat identifier space being concurrently 
accessed by many network nodes (e.g., 12a-12cz). The file 
system 60 maps the file data and metadata logically repre- 
sented as byte streams in pages, and a page is the unit of the 
addressable shared memory space 20. From the viewpoint of ^ 
the file system 60, a data stream is stored in an ordered 
collection of pages. The file system 60 calls the engine 
described in the above-identified, incorporated-by-reference 
U.S. patent application to allocate space in pages from the 
addressable shared memory space 20 in order to store its 30 
metadata and user file data. Sections of the addressable 
shared memory space 20 that are reserved by the file system 
60 can be accessed by an instance of the file system 60 in the 
addressable shared memory space 20 but not by other types 
of network nodes. 35 

Each instance of the file system 60 (including the data 
control programs 32a-34rf of FIG. 1) is a peer-of all other 
network nodes running the file system 60 and thus sharing 
file data via the addressable shared memory space 20. To the 
local users, the file system 60 exhibits the characteristics of 40 
single node consistency, and file sharing behaves as if 
sharing with other processes on the same node. From a file 
system 60 user's viewpoint, the following local behavior is 
observed: (a) file data update is consistent in the entire 
network, i.e., if a file page is changed, the modification is 45 
seen immediately by other users with open handles to the 
file, and for shared write file access, the coordination of the 
shared write access, if any, rests entirely with the users, 
usually by means of byte range locking; and (b) flush is a 
global operation in the network, i.e., a valid dirty file page 50 
can be flushed from any node that has the file open, and if 
the flush is not successfully completed, the resulting file data 
state could be either the old state before the flush or a 
partially written state, and the file system 60 instances utilize 
the shared address space both as data storage and as mecha- 55 
nism for information passing (locking and information 
sharing). 

The following is a list of the types of requests the file 
system 60 can issue to underlying engine disclosed in the 
above-identified, incorporated-by-reference U.S. patent 60 
application: (a) reserve a chunk of the addressable shared 
memory space 20 — the file system 60 requests a contiguous 
range of addresses to be reserved, and the reservation of 
shared memory space does not cause the materialization 
(allocation) of the memory space but instead it merely 65 
reserves the space represented by the addresses; (b) unre- 
serve a chunk of addresses in the addressable shared 
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memory space 20 — the file system 60 frees a range of 
addresses that is no longer used, and there is no unreserve 
call since addresses are not re-used once discarded; (c) 
materialize a chunk of the addressable shared memory space 
20 — allocate shared memory space for previously reserved 
addresses, and the allocated space is returned as pages, but 
only after reserved shared memory space is allocated that it 
can be accessed by the file system 60, and allocated pages 
are accessible but to access them the file system 60 has to 
make formal access requests and the file system 60 must 
specify whether the pages being allocated are persistent or 
volatile, and also the file system 60 can optionally specify 
the number of replicas required for the pages; (d) deallocate 
pages — a number of pages is freed by the file system 60, and 
the resources represented by the pages can be recycled; (e) 
access range of pages — the file system 60 requests read or 
write access to pages, and this will cause the page to be 
brought to the local node memory; (f) unreference a range of 
pages — the file system 60 indicates that it no longer needs to 
access the pages; (g) lock and unlock a page — the file system 
60 uses the lock and unlock page to synchronize read-write 
updates to a page, and the lock semantic "shared, exclusive" 
is used; (h) flush dirty pages — the file system 60 requests a 
flush of dirty pages to persistent storage, and the flush is 
typically carried out when explicitly requested by the file 
system's callers when file is closed or during a file system 
shutdown; (i) set attributes on pages and subpages — the file 
system 60 uses the page attributes to specify the desired page 
management behavior, and some of the attributes set by the 
file system include the best coherency protocol for the page, 
how the pages should be replicated (number of core copies), 
whether the pages should be cached in the RAM cache, set 
a 16-byte file system attribute on the page; (j) read (get) 
attributes of a page; (k) get the location of the super_root; 
(1) get the total storage space (in bytes) and the total free 
storage space (in bytes) of the addressable shared memory 
space 20; and (m) transaction logging operations such as 
transaction begin, end, commit, abort, get log records, etc. 

Addressable Shared Memory Space 

Heaving described the invention and various embodi- 
ments thereof in some detail, a more detailed description is 
now provided of the addressable shared memory space that 
is disclosed in the commonly-owned U.S. patent application 
Ser. No. 08/754,481 filed Nov. 22, 1996, and incorporated 
by reference above. All of the information provided below 
is contained in that patent application. 

The addressable shared memory system disclosed in the 
U.S. patent application incorporated by reference is an 
"engine" that can create and manage a virtual memory space 
that can be shared by each computer on a network and can 
span the storage space of each memory device connected to 
the network. Accordingly, all data stored on the network can 
be stored within the virtual memory space and the actual 
physical location of the data can be in any of the memory 
devices connected to the network. 

More specifically, the engine or system can create or 
receive, a global address signal that represents a portion, for 
example 4 k bytes, of the virtual memory space. The global 
address signal can be decoupled from, i.e., unrelated to, the 
physical and identifier spaces of the underlying computer 
hardware, to provide support for a memory space large 
enough to span each volatile and persistent memory device 
connected to the system. For example, systems of the 
invention can operate on 32-bit computers, but can employ 
global address signals that can be 128 bits wide. 
Accordingly, the virtual memory space spans 2 128 bytes, 
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which is much larger than the 2 32 address space supported including any disk, RAID, tape or other device that provides 

by the underlying computer hardware. Such an address persistent data storage. 

space can be large enough to provide a separate address for The systems can also include a coherent replication 

every byte of data storage on the network, including all controller for generating a copy, or select number of copies, 

RAM, disk and tape storage. 5 0 f a portion of the addressable memory space maintained in 

For such a large virtual memory space, typically only a the local persistent memory device of a first computer and 

small portion is storing data at any time. Accordingly, the for storing the copy in the local persistent memory device of 

system includes a directory manager that tracks those por- a second computer. The coherent replication controller can 

tionsof the virtual memory space that are in use. The system maintain the coherency of the copies to provide coherent 

provides physical memory storage for each portion of the 10 data replication. 

virtual memory space in use by mapping each such portion The systems can also be understood to provide integrated 

to a physical memory device, such as a RAM memory or a control of data stored in volatile memory and in persistent 

hard-drive. Optionally, the mapping includes a level of memory. In such systems a volatile memory device has 

indirection that facilitates data migration, fault-tolerant volatile storage for data signals, and the shared memory 

operation, and load balancing. 15 subsystem includes an element, typically a software module, 

By allowing each computer to monitor and track which for mapping a portion of the addressable memory space to 

portions of the virtual memory space are in use, each a portion of the volatile storage. In these systems the volatile 

computer can share the memory space. This allows the memory device can be comprised of a plurality of local 

networked computers to appear to have a single memory, volatile memory devices each coupled to a respective one of 

and therefore can allow application programs running on 20 the plural computers, and the persistent memory device can 

different computers to communicate using techniques cur- be comprised of a plurality of local persistent memory 

rently employed to communicate between applications run- devices each coupled to a respective one of the plural 

ning on the same machine. computers. 

In one aspect, the invention of the above-identified, 1° these systems, a directory manager can track the 

incorporated-by-reference U.S. patent application can be mapped portions of the addressable memory space, and can 

understood to include computer systems having a address- include two sub -components; a disk directory manager for 

able shared memory space. The systems can comprise a data tracking portions of the addressable memory space mapped 

network that carries data signals representative of computer to the local persistent memory devices, and a RAM directory 

readable information a persistent memory device that 3Q manager for tracking portions of the addressable memory 

couples to the data network and that provides persistent data space mapped to the local volatile memory devices, 

storage, and plural computers that each have an interface Optionally, a RAM cache system can operate one of the 

that couples to the data network, for accessing the data local volatile memory devices as a cache memory for cache 

network to exchange data signals therewith. Moreover, each storing data signals associated with recently accessed por- 

of the computers can include a shared memory subsystem 35 tions of the addressable memory space, 

for mapping a portion of the addressable memory space to The systems can include additional elements including a 

a portion of the persistent storage to provide addressable paging element for remapping a portion of the addressable 

persistent storage for data signals. memory space between one of the local volatile memory 

In a system that distributes the storage across the memory devices and one of the local persistent memory devices; a 

devices of the network, the persistent memory device will be 40 P olic y controller for determining a resource available signal 

understood to include a plurality of local persistent memory representative of storage available on each of the plural 

devices that each couple to a respective one of the plural computers and, a paging element that remaps the portion of 

computers. To this same end, the system can also include a addressable memory space from a memory device of a first 

distributor for mapping portions of the addressable memory computer to a memory device of a second computer, respon- 

space across the plurality of local persistent memory devices 45 sive to the resource available signal; and a migration con- 

and a disk directory manager for tracking the mapped troller for moving portions of addressable memory space 

portions of the addressable memory space to provide infor- between the local volatile memory devices of the plural 

mation representative of the local persistent memory device computers. 

that stores that portion of the addressable memory space Optionally, the systems can include a hierarchy manager 
mapped thereon. 50 for organizing the plural computers into a set of hierarchical 
The systems can also include a cache system for operating groups wherein each group includes at least one of the plural 
one of the local persistent memory devices as a cache computers. Each the group can include a group memory 
memory for cache storing data signals associated with manager for migrating portions of addressable memory 
recently accessed portions of the addressable memory space. s P ac e as a function of the hierarchical groups. 
Further the system can include a migration controller for 55 The system can maintain coherency between copied por- 
selectively moving portions of the addressable memory tions of the memory space by including a coherent replica- 
space between the local persistent memory devices of the tion controller for generating a coherent copy of a portion of 
plural computers. The migration controller can determine addressable memory space. 

and respond to data access patterns, resource demands or The system can generate or receive global address signals, 

any other criteria or heuristic suitable for practice with the 60 Accordingly the systems can include an address generator 

invention. Accordingly, the migration controller can balance for generating a global address signal representative of a 

the loads on the network, and move data to nodes from portion of addressable memory space. The address generator 

which it is commonly accessed. The cache controller can be can include a spanning unit for generating global address 

a software program running on a host computer to provide signals as a function of a storage capacity associated with the 

a software managed RAM and disk cache. The RAM can be 65 persistent memory devices, to provide global address signals 

any volatile memory including SRAM, DRAM or any other capable of logically addressing the storage capacity of the 

volatile memory. Hie disk can be any persistent memory persistent memory devices. 
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In distributed systems, the directory manager can be a CPU 214, an operating system 216, an optional private 

distributed directory manager for storing within the distrib- memory device 218, and a shared memory subsystem 220, 

uted memory space, a directory signal representative of a As further depicted in by FIG. 5, each node 212a-212c 

storage location of a portion of the addressable memory connects via the shared memory subsystem 220 to a virtual 

space. The distributed directory manager can include a 5 shared memory 222. As will be explained in greater detail 

directory page generator for allocating a portion of the hereinafter, by providing the shared memory subsystem 220 

addressable memory space and for storing therein an entry that allows the node 212a-212c to access the virtual shared 

signal representative of a portion of the directory signal. The memory 222, the computer network 210 enables network 

directory page generator optionally includes a range gen- nodes 212a-212c to communicate and share functionality 

era tor for generating a range signal representative of a iQ using the same techniques employed by applications when 

portion of the addressable memory space, and for generating communicating between applications running on the same 

the entry signal responsive to the range signal, to provide an machine. These techniques can employ object linking and 

entry signal representative of a portion of the directory embedding, dynamic link libraries, class registering, and 

signal that corresponds to the portion of the addressable other such techniques. Accordingly, the nodes 212 can 

memory space. Moreover, the distributed directory manager i5 employ the virtual shared memory 222 to exchange data and 

can include a linking system for linking the directory pages objects between application programs running on the dif- 

to form a hierarchical data structure of the linked directory ferent nodes 212 of the network 210. 

pages as well as a range linking system for linking the i n me embodiment depicted in FIG. 7, each node 212 can 

directory pages, as a function of the range signal, to form a De a conventional computer system such as a commercially 

hierarchical data structure of linked directory pages. ^ available IBM PC compatible computer system. The pro- 

As the data stored by the system can be homeless, in that cessor 214 can be any processor unit suitable for performing 

the data has no fixed physical home, but can migrate, as the data processing for that computer system. The operating 

resources and other factors dictate, between the memory system 216 can be any commercially available or propri- 

devices of the network, a computer system according to the etary operating system that includes, or can access, functions 

invention can include a directory page generator that has a ^ for accessing the local memory of the computer system and 

node selector for generating a responsible node signal rep- networking. 

resentative of a select one of the plural computers having The private memory device 218 can be any computer 

location information for a portion of the shared address memory device suitable for storing data signals representa- 

space. This provides a level of indirection that decouples the Uve 0 f computer readable information. The private memory 

directory from the physical storage location of the data. 3Q prov ides the node with local storage that can be kept 

Accordingly, the directory needs only to identify the node, inaccessible to the other nodes on the network. Typically the 

or other device, that tracks the physical location of the data. private memory device 218 includes a RAM, or a portion of 

This way, each time data migrates between physical storage a r^\M memory, for temporarily storing data and application 

locations, the directory does not have to be updated, since programs and for providing the processor 214 with memory 

the node tracking the location of the data has not changed 3S storage for executing programs. The private memory device 

and still provides the physical location information. \g can ^ include persistent memory storage, typically a 

Accordingly, the system can include page generators that hard disk unit or a portion of a hard disk unit, for the 

generate directory pages that carry information representa- persistent storage of data. 

tive of a location monitor, such as a responsible computer The shared memory subsystem 220 depicted in FIG. 7 is 

node, that tracks a data storage location, to provide a ^ an embodiment of the invention that couples between the 

directory structure for tracking homeless data. Moreover, the operating system 216 and the virtual shared memory 222 

directory itself can be stored as pages within the virtual anc j f orms an interface between the operating system 216 

memory space. Therefore, the data storage location can store an d the virtual shared memory to allow the operating system 

information representative of a directory page, to store the 216 to access the virtual shared memory 222. The depicted 

directory structure as pages of homeless data. 45 shared memory subsystem 220 is a software module that 

In another aspect, the invention of the above-identified, operates as a stand-alone distributed shared memory engine, 

incorporated-by-reference U.S. patent application can be The depicted system is illustrative and other systems of the 

understood as methods for providing a computer system invention can be realized as shared memory subsystems that 

having a addressable shared memory space. The method can can be embedded into an application program, or be imple- 

include the steps of providing a network for carrying data 50 mented as an embedded code of a hardware device. Other 

signals representative of computer readable information, such applications can be practiced without departing from 

providing a hard-disk, coupled to the network, and having the scope of the invention. 

persistent storage for data signals, providing plural The depicted virtual shared memory 222 illustrates a 

computers, each having an interface, coupled to the data virtual shared memory that is accessible by each of the nodes 

network, for exchanging data signals between the plural S5 212a-212c via the shared memory subsystem 220. The 

computers, and assigning a portion of the addressable virtual shared memory 222 can map to devices that provide 

memory space to a portion of the persistent storage of the physical storage for computer readable data, depicted in 

hard disk to provide addressable persistent storage for data pjc. 7 as a plurality of pages 224<z-224<l In one 

signals. embodiment, the pages form portions of the shared memory 

Turning now to the drawings related to the addressable eo space and divide the address space of the shared memory 

shared memory system or engine of the above-identified, into page addressable memory spaces. For example the 

incorporated-by-reference U.S. patent application, FIG. 7 address space can be paged into 4K byte sections. In other 

illustrates a computer network 10 that provides a shared embodiments alternative granularity can be employed to 

memory that spans the memory space of each node of the manager the shared memory space. Each node 212a-212c 

depicted computer network 210. 65 through the shared memory subsystem 220 can access each 

Specifically, FIG. 7 illustrates a computer network 210 page 224a-224rf stored in the virtual shared memory 222. 

that includes a plurality of nodes 212a-212c, each having a Each page 224a-224d represents a unique entry of computer 
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data stored within the virtual shared memory 222. Each page Referring to the embodiment depicted in FIG. 2, the nodes 

224a-224rf is accessible to each one of the nodes of the network may be organized into a hierarchy of groups. 

212a-212c, and alternatively, each node cao store additional In these embodiments, the memory subsystems 232a-232c 

pages of data within the virtual shared memory 222. Each can include a hierarchy manager that provides hierarchical 

newly stored page of data can be accessible to each of the 5 control for the distribution of data. This includes controlling 

other nodes 212a-212c Accordingly, the virtual shared me migration controller, and policy controller, which are 

memory 222 provides a system for sharing and communi- discussed in detail below, to perform hierarchical data 

eating data between each node 212 of the computer network migration and load balancing such that data migrates pn- 

2^q manly between computers of the same group, and passes to 

other groups in hierarchical order. Resource distribution is 

FIG. 8 illustrates in functional block diagram form a 10 similarly managed 

computer network 230 that has ;a attributed shared memory. FIG. 9 illustrates in more detail one shared memory 

In this embodiment, each node 212a-212c has a memory subsystem 24 0 according to the invention. FIG. 9 depicts a 

subsystem 232 that connects between the operating system sfaared memory su5system 240> that inchldes an interface 

216 and the two local memory devices, the RAM 234 and ^ a DSM & ndbory manager 24 4, a memory controller 

the disk 236, and that further couples to a network 238 that is 24fi a ^ ^ ^ COQtroUer 24g ^ a local 

couples to each of the depicted nodes 212a, 2126 and 212c cache controller 25 o. FIGt 9 further depicts the network 254, 

and to a network memory device 226. an optioQal of me DSM system, depicted as the 

More particularly, FIG. 8 illustrates a distributed shared service 258, the operating system 216, a disk driver 260, a 

memory network 30 that includes a plurality of nodes disk element 262 and a RAM element 264. 

212a-212c, each including a processing unit 214, an oper- 2 shared memory subsystem 240 depicted in FIG. 9 can 

ating system 216, a memory subsystem 232, a RAM 234, encapsulate the memory management operations of the 

and a disk 236. FIG. 8 further depicts a computer network network node 212 to provide a virtual shared memory that 

system 38 that connects between the nodes 212a-212c and can span across cach node tnat conne cts into the network 

the network memory device 226. The network 238 provides 254 Accordingly, each local node 212 views the network as 

a network communication system across these elements. a set 0 f nodes t hat are each connected to a large shared 

The illustrated memory subsystems 232a-232c that con- computer memory, 

nect between the operating system 216a-216c, the memory The depicted interface 242 provides an entry point for the 

elements 234a-234c, 236fl-236c, and the network 238, i oca i node to access the shared memory space of the 

encapsulate the local memories of each of the nodes to 3Q computer network. The interface 242 can couple directly to 

provide an abstraction of a shared virtual memory system the operating system 216, to a distributed service utility such 

that spans across each of the nodes 212a-212c on the ^ the depicted DSM file system 258, to a distributed 

network 238. The memory subsystems 232a-232c can be user-level service utility, or alternatively to any combination 

software modules that act as distributors to map portions of thereof 

the addressable memory space across the depicted memory 3S ^ depicted interface 242 provides an API that is a 
devices. The memory subsystems further track the data memory oriented API. Thus, the illustrated interface 242 can 
stored in the local, memory of each node 212 and further export a set 0 f interfaces that provide low-level control of 
operate network connections with network 238 for transfer- the distributed memory. As illustrated in FIG. 9, the interface 
ring data between the nodes 212a-212c. In this way, the 2 42 exports the API to the operating system 216 or to the 
memory subsystems 232a-232caccess and control each ^ optional DSM service 258. The operating system 216 or the 
memory element on the network 238 to perform memory service employs the interface 242 to request standard 
access operations that are transparent to the operating sys- mem ory management techniques, such as reading and writ- 
tem 216. Accordingly, the operating system 216 interfaces ing from port j ons 0 f the memory space. These portions of 
with the memory subsystem 232 as an interface to a global the mem ory space can be the pages as described above 
memory space that spans each node 212a-212con the net- 45 which can be 4 K byte DOrt j ons Q f the shared memory space, 
work 238. or otDer 0 f memory, such as objects or segments. Each 
FIG. 8 further depicts that the system 230 provides a page can be located within the shared memory space which 
distributed shared memory that includes persistent storage is designated by a global address signal foT that page of 
for portions of the distributed memory. In particular, the memory. The system can receive address signals from an 
depicted embodiment includes a memory subsystem, such as 50 application program or, optionally, can include a global 
subsystem 232a, that interfaces to a persistent memory address generator that generates the address signals. The 
device, depicted as the disk 236a. The subsystem 232a can address generator can include a spanning module that gen- 
operate the persistent memory device to provide persistent erates address signals for a memory space that spans the 
storage for portions of the distributed shared memory space. storage capacity of the network. 

As illustrated, each persistent memory device 236 depicted S5 Accordingly, in one embodiment, the interface 242 

in FIG. 8 has a portion of the addressable memory space receives requests to manipulate pages of the shared memory 

mapped onto it. For example, device 236a has the portions space. To this end, the interface 242 can comprise a software 

of the addressable memory space, C o3 C d , mapped onto module that includes a library of functions that can be called 

it, and provides persistent storage for data signals stored in by services, the OS 216, or other caller, or device. The 

those ranges of addresses. 60 function calls provide the OS 216 with an API of high level 

Accordingly, the subsystem 232a can provide integrated memory oriented services, such as read data, write data, and 

control of persistent storage devices and electronic memory allocate memory. The implementation of the functions can 

to allow the distributed shared memory space to span across include a set of calls to controls that operate the directory 

both types of storage devices, and to allow portions of the manager 244, and the local memory controller 246. 

distributed shared memory to move between persistent and 65 Accordingly, the interface 242 can be a set of high level 

electronic memory depending on predetermined conditions, memory function calls to interface to the low- level func- 

such as recent usage. tional elements of shared memory subsystem 240. 
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FIG. 9 further depicts a DSM directory manager 244 that owner node. The owner node receives the memory request 

couples to the interface 242. The interface 242 passes across network 254 and through network module 252 that 

request signals that represent requests to implement memory passes the memory request to the interface 242 of that owner 

operations such as allocating a portion of memory, locking node. The interface 242 couples to the memory controller 

a portion of memory, mapping a portion of memory, or some 5 246 and can pass the memory request to the local memory 

other such memory function. The directory manager 244 controller of that owner node for operating the local storage 

manages a directory that can include mappings than can elements, such as the disk or RAM elements, to perform the 

span across each memory device connected to the network requested memory operation. 

238 depicted in FIG. 8, including each RAM and disk Once the owner node has performed the requested 

element accessible by the network. The directory manager 10 memory operation, such as reading a page of data, the 

244 stores a global directory structure that provides a map of memory subsystem 240 of the owner node can then transfer 

the global address space. In one embodiment as will be the page of data, or a copy of the page of data, via the 

explained in greater detail hereinafter, the directory manager network 254 to the node that originally requested access to 

244 provides a global directory that maps between global that portion of the shared memory. The page of data is 

address signals and responsible nodes on the network. A 1S transferred via the network 254 to the network module 252 

responsible node stores information regarding the location ° f * he ^sting node and the shared memory subsystem 

and attributes of data associated with a respective global 240 operates the memory controller 246 to store in the local 

,j j .* Hi f *u * > j * memory of the requesting node a copy of the accessed data, 

address, and optionally stores a copy ot that page s data. * ,- , . . .■ t . - • i 

Consequently, the directory manager 244 tracks information . Accordingly, in one embodiment of the invention, when a 

c ^ . J jj i ? **u ■ *u "a *c first node accesses a page of the shared memory space which 

for accessing any address location within the identifier on . , # « *u . a .u a- * <*aa 

fe J 20 is not stored locally on that node, the directory manager 244 

s P ace - identifies a node that has a copy of the data stored in that 

The control of the distributed shared memory can be page and moves a copy of that data mt0 ^ local mem0 ry 

coordinated by the directory manager 244 and the memory of the requesting node. The local memory storage, both 

controller 246. The directory manager 244 maintains a volatile (e.g. local RAM) and persistent (e.g. local disk 

directory structure that can operate on a global address 25 storage), of the requesting node therefore becomes a cache 

received from the interface 242 and identify, for that address, for pages that have been requested by that local node. This 

a node on the network that is responsible for maintaining the embodiment is depicted FIG. 9 which depicts a memory 

page associated with that address of the shared memory controller that has a local disk cache controller 248 and a 

space. Once the directory manager 244 identifies which local RAM cache controller 250. Both of these local cache 

node is responsible for maintaining a particular address, the 30 controllers can provide to the operating system 216, or other 

directory manager 244 can identify a node that stores consumer pages of the shared memory space that are cache 

information for locating a copy of the page, and make the stored in the local memory of the node, including local 

call to the memory controller 246 of that node and pass to persistent memory and local volatile memory, 

that node's memory controller the memory request provided The shared memory subsystem can include a coherent 

by the memory interface 242. Accordingly, the depicted 35 replication controller that maintains coherency between 

directory manager 244 is responsible for managing a direc- cached pages by employing a coherence through invalida- 

tory structure that identifies for each page of the shared tion process, a coherence through migration process or other 

memory space a responsible node that tracks the physical coherence process suitable for practice with the present 

location of the data stored in the respective page. Thus, the invention. The coherent replication controller can automati- 

directory, rather than directly providing the location of the 40 cally generate a copy of the data stored in each page and can 

page, can optionally identify a responsible node, or other store the copy in a memory device that is separate from the 

device, that tracks the location of the page. This indirection memory device of the original copy. This provides for fault 

facilitates maintenance of the directory as pages migrate tolerant operation, as the failure of any one memory device 

between nodes. will not result in the loss of data. The coherent replication 

The memory controller 246 performs the low level 45 controller can be a software model that monitors all copies 

memory access functions that physically store data within of pages kept in volatile memory and made available for 

the memory elements connected to the network. In the writing. The controller can employ any of the coherency 

depicted embodiment, the directory manager 244 of a first techniques named above, and can store tables of location 

node can pass a memory access request through the interface information that identifies the location information for all 

242, to the network module of the OS 216, and across the 50 generated copies. 

network 254 to a second node that the directory manager 244 FIG. 10 illustrates in greater detail one embodiment of a 

identifies as the responsible node for the given address. The shared memory subsystem according to the invention. The 

directory manager 244 can then query the responsible node shared memory subsystem 270 depicted in FIG. 10 includes 

to determine the attributes and the current owner node of the a remote operations element 274, a local RAM cache 276, a 

memory page that is associated with the respective global 55 RAM copyset 278, a global RAM directory 280, a disk 

address. The owner of the respective page is the network copyset 282, a global disk directory 284, a configuration 

node that has control over the memory storage element on manager 288, a policy element 290, and a local disk cache 

which the data of the associated page is stored. The memory 94. FIG. 10 further depicts a network element 304, a 

controller 246 of the owner can access, through the OS 216 physical memory 300, shared data element 302, a physical 

of that node or through any interface, the memory of the eo file system 298, which is part of the operating system 216, 

owner node to access the data of the page that is physically a configuration service 308, a diagnostic service 310, and a 

stored on that owner node. memory access request 312. The depicted subsystem 270 

In particular, as depicted in FIG. 9, the directory manager can be a computer program that couples to the physical 

244 couples to the network module 252 which couples to the memory, file system, and network system of the host node, 

network 254. The directory manager can transmit to the 65 or can be electrical circuit card assemblies that interface to 

network module 252 a command and associated data that the host node, or can be a combination of programs and 

directs the network interface 252 to pass a data signal to the circuit card assemblies. 
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The flow scheduler 272 depicted in FIG. 10 can orches- store copies on pages or multiple nodes. The global disk 

trate the controls provided by an API of the subsystem 270. directory 284 maps address ranges to nodes that are respon- 

ln one embodiment, the flow scheduler 272 can be a state sible for managing the pages within each range. The nodes 

machine that monitors and responds to the requests 312 and responsible for a range of addresses will be referred to as the 

remote requests through network 304 which can be instruc- 5 "core holders" of those pages. 

tions for memory operations and which can include signals £ach can be ^ d a minimum number of ^ 

representative of the global addresses being operated on. bo , ders below which ft &hoM n0 , faU For eJcajn j tf a 

These memory operaUon requests 312 can act as op-codes ^ ^ d ^ as ^ minimum number of ^ 

for pnmUive operations on one or more global addresses. bo , ders and ^ lhird mn suffcis a fajlure wWcb 

They can be read and write requests or other memory 10 ^ access ((J , he (he has Men below ;f 

operations. Alternatively, the now scheduler 272 can be a r c . £ * u u j »i_ 

v t . j7 , . , preferred minimiim number of core holders and another 

program, such as an mterpretcr, that provides an execution of ^ should be made of ^ 

environment and can map these opcodes into control flow be made Qn ^ ^ holder n£)de fjf u did ^ cQm _ 

programs called applets. The app ets can be independent j M) of a Mw ^ holder bfi ^ , 

executable programs that employ both environment 1C f flL . , , r.u • • t_ u 

f 5 7 ' K , J . . , I X. 15 of the page given to it by one of the surviving core holders, 

services, such as threading, synchronization, and buffer . „ , , , . _ . , , 

management, and the elements depicted in FIG. 10. The API f M allud f to above > ^duplexing is also used to recover 

is capable of being called from both external clients, like a from complete node, and reduplexing after a node failure 

distributed shared memory file system, as well as recursively wlU be « Skater detail below, 

by the applets and the other elements 274-294 of the 2Q The local memory controller of the subsystem 270 is 

subsystem 270. Each element can provide a level of encap- provided by the local RAM cache 276 and the local disk 

sulation to the management of a particular resource or aspect cacne 294 The local RAM cache 276 wnich couples to the 

of the system. To this end, each element can export an API physical memory 300 of the local node can access, as 

consisting of functions to be employed by the applets. This described above, the virtual memory space of the local node 

structure is illustrated in FIG. 10. Accordingly, the flow ^ to access data that k physically stored within the RAM 

scheduler 272 can provide an environment to load and memory 300. Similarly, the local disk cache 294 couples to 

execute applets. The applets are dispatched by the flow the persistent storage device 298 and can access a physical 

scheduler 272 on a per op-code basis and can perform the location that maintains in the local persistent storage data of 

control flow for sequential or parallel execution of an me distributed shared memory. 

element to implement the op-code on the specified global 30 FIG. 10 also depicts a remote operations element 274 that 
address, such as a read or write operation. Optionally, the couples between the network 304 and the flow scheduler 
flow scheduler 272 can include an element to change 272. The remote operations element 274 negotiates the 
dynamically the applet at run time as well as execute applets transfer of data across the network 304 for moving portions 
in parallel and in interpreted mode. of the data stored in the shared memory space between the 
The depicted shared memory subsystem 270 includes a 3S nodes of the network. The remote operations element 274 
bifurcated directory manager that includes the global RAM can also request services from remote peers, i.e., invalidate 
directory 280 and the global disk directory 284. The global t0 help maintain coherency or for other reasons. 
RAM directory 280 is a directory manager that tracks FIG. 10 also depicts a policy element 290 that can be a 
information that can provide the location of pages that are software module that acts as a controller to determine the 
stored in the volatile memory, typically RAM, of the net- 40 availability of resources, such as printer capabilities, hard- 
work nodes. The global disk directory 284 is a global disk disk space, available RAM and other such resources. The 
directory manager that manages a directory structure that policy controller can employ any of the suitable heuristics to 
tracks information that can provide the location of pages that direct the elements, such as the paging controller, disk 
are stored on persistent memory devices. Together, the directory manager, and other elements to dynamically dis- 
global RAM directory 280 and the global disk directory 284 45 tribute the available resources. 

provide the shared memory subsystem 270 with integrated FIG. 10 further depicts a memory subsystem 270 that 

directory management for pages that are stored in persistent includes a RAM copyset 278 and a disk copyset 282. These 

storage and volatile memory. copysets can manage copies of pages that are cached at a 

In one embodiment a paging element can operate the single node. Hie disk copyset 282 can maintain information 

RAM and disk directory managers to remap portions of the 50 on copies of pages that are stored in the local disk cache, 

addressable memory space between one of the volatile which can be the local persistent memory. Similarly, the 

memories and one of the persistent memories. In the shared RAM copyset 278 can maintain information on copies of 

memory system, this allows the paging element to remap pages that are stored in the local RAM cache which can be 

pages from the volatile memory of one node to a disk the local RAM. These copysets encapsulate indexing and 

memory of another node. Accordingly, the RAM directory 55 storage of copyset data that can be employed by applets or 

manager passes control of that page to the disk directory other executing code for purposes of maintaining the coher- 

manager which can then treat the page as any other page of ency of data stored in the shared memory space. The copyset 

data. This allows for improved load balancing, by removing elements can maintain copyset data that identifies the pages 

data from RAM memory, and storing it in the disk devices, cached by the host node. Further, the copyset can identify 

under the control of the disk directory manager. eo the other nodes on the network that maintain a copy of that 

Data may be stored in the Ram memory of more than one page, and can further identify for each page which of these 

node, the persistent memory of more than one node, or some nodes is the owner node, wherein the owner node can be a 

combination of RAM and persistent memory distributed node which has write privileges to the page being accessed, 

throughout the network. This natural distribution of data The copysets themselves can be stored in pages of the 

present in the system provides a first line of defense against 65 distributed shared memory space. 

node failures. In addition to the natural distribution of data, The local RAM cache 276 provides storage for memory 

the system may "duplex" pages of data, i.e., the system may pages and their attributes. In one embodiment, the local 
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RAM cache 276 provides a global address index for access- 
ing the cached pages of the distributed memory and the 
attributes based on that page. In this embodiment, the local 
ram cache 276 provides the index by storing in memory a list 
of each global address cached in the local RAM. With each 
listed global address, the index provides a pointer into a 
buffer memory and to the location of the page data. 
Optionally, with each listed global address, the index can 
further provide attribute information including a version tag 
representative of the version of the data, a dirty bit repre- 
sentative of whether the RAM cached data is a copy of the 
data held on disk, or whether the RAM cached data has been 
modified but not yet flushed to disk, a volatile bit to indicate 
if the page is backed by backing store in persistent memory, 



space into two sub-portions. In this example, the start 
address range of the directory entry 324 could be the base 
address of the address space, and the start address range of 
the directory entry 326 could be the address for the upper 

5 half of the memory space. Accordingly, the directory entry 
324 provides an index for pages stored in the address space 
between the base address and up to the mid-point of the 
memory space and, in complement thereto, the directory 
entry 326 provides an index to pages stored in the address 

10 space that ranges from the mid-point of the address space to 
the highest address. 

FIG. 11 further depicts a directory page 320 that includes, 
in each directory entry, a responsible node field 332 and the 



child page global address field 334. These fields 332, 334 
and other such attribute information useful for managing the 15 provide further location information for the data stored in 
coherency of the stored data. pages within the address range identified in field 330. 

In the embodiment depicted in FIG. 10, the memory piG. 12 depicts a directory 340 formed from directory 

subsystem 270 provides the node access to the distributed pages similar to those depicted in FIG. 9. FIG. 12 depicts 
memory space by the coordinated operation of the directory tnat the directory 340 includes directory pages 342, 
manager that includes the global RAM directory 280 and the 20 350.354 and 360-366. FIG. 12 further depicts that the 



global disk directory 284, the cache controller that includes 
the local RAM cache and the local disk cache elements 276 
and 294, and the copyset elements which include the RAM 
copyset 278 and the disk copyset 282. 

The directory manager provides a directory structure that 
indexes the shared address space. Continuing with the 
example of a paged shared address space, the directory 
manager of the subsystem 270 allows the host node to 
access, by global addresses, pages of the shared memory 
space. 

FIGS. 11 and 12 illustrate one example of a directory 
structure that provides access to the shared memory space. 
FIG. 11 depicts a directory page 320 that includes a page 
header 322, directory entries 324 and 326, wherein each 
directory entry includes a range field 330, a responsible node 
field 332, and an address field 334. The directory pages can 
be generated by a directory page generator that can be a 
software module controlled by the directory manager. It will 
be understood that the directory manager can generate 
multiple directories, including one for the Global disk and 
one for the Global RAM directories. The depicted directory 
page 320 can be a page of the global address space, such as 
a 4K byte portion of the shared address space. Therefore, the 
directory page can be stored in the distributed shared 
memory space just as the other pages to which the directory 
pages provide access. 

As further depicted in FIG. 11, each directory page 120 
includes a page header 322 that includes attribute infonna- 
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directory 340 provides location information to the pages of 
the distributed shared memory space depicted in FIG. 12 as 
pages 370-384. 

The directory page 342 depicted in FIG. 12 acts like a root 
directory page and can be located at a static address that is 
known to each node coupled to the distributed address space. 
The root directory page 342 includes three directory entries 
344, 346, and 348. Each directory entry depicted in FIG. 12 
has directory entries similar to those depicted in FIG. 11. For 
example, directory entry 344 includes a variable Co which 
represents the address range field 330, a variable Nj repre- 
sentative of the field 332, and a variable Cs representative of 
the field 334. The depicted root directory page 342 subdi- 
35 vides the address space into three ranges illustrated as an 
address range that extends between the address Co and Cd, 
a second address range that extends between the address Cd 
and Cg, and a third address range that extends between Cg 
and the highest memory location of the address space. 
4Q As further depicted in FIG. 12, each directory entry 344, 
346, and 348 points to a subordinate directory page, depicted 
as directory pages 350, 352, and 354, each of which further 
subdivides the address range index by the associated direc- 
tory entry of the root directory 342. In FIG. 11, this 
45 subdivision process continues as each of the directory pages 
350, 352, and 354 each again have directory entries that 
locate subordinate directory pages including the depicted 
examples of directory pages 360, 362, 364, and 366. 



The depicted example of directory pages 360, 362, 364, 

tion for that page header, which is typically metadata for the 50 and 366 are each leaf entries. The leaf entries contain 

directory page, and further includes directory entries such as directory entries such as the directory entries 356 and 358 of 

the depicted directory entries, 324 and 326, which provide the leaf entry 360, that store a range field 330 and the 

an index into a portion of the shared address space wherein responsible node field 332. These leaf entries identify an 

that portion can be one or more pages, including all the address and a responsible node for the page in the distributed 

pages of the distributed shared memory space. The depicted 55 memory space that is being accessed, such as the depicted 

directory page 320 includes directory entries that index a pages 370-384. For example, as depicted in FIG. 12, the leaf 

selected range of global addresses of the shared memory entry 356 points to the page 370 that corresponds to the 

space. To this end, the directory generator can include a range field 330 of the leaf entry 356, which for a leaf entry 

range generator so that each directory entry can include a is the page being accessed. In this way, the directory 

range field 330 that describes the start of a range of 60 structure 340 provides location information for pages stored 

addresses that that entry locates. in the distributed address space. 

Accordingly, each directory page 320 can include a plu- In the depicted embodiment of FIG. 12, a node selector 

rality of directory entries, such as entries 324 and 326, that can select a responsible node for each page, as described 

can subdivide the address space into a subset of address above, so that the leaf entry 356 provides information of the 

ranges. For example, the depicted directory page 320 65 address and responsible node of the page being located, 

includes two directory entries 324 and 326. The directory Accordingly, this directory tracks ownership and responsi- 

entries 324 and 326 can, for example, subdivide the address bility for data, to provide a level of indirection between the 
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directory and the physical location of the data. During a 
memory access operation, the memory subsystem 270 
passes to the responsible node indicated in the leaf entry 356 
the address of the page being accessed. The shared memory 
subsystem of that node can identify a node that stores a copy 
of the page being accessed, including the owner node. This 
identification of a node having a copy can be performed by 
the RAM copyset or disk copyset of the responsible node. 
The node having a copy stored in its local physical memory, 



portions of the address space being employed by the system 
400, and physical storage for each page is provided within 
the local memories. 

As shown in FIG. 13, the data associated with the direc- 
tory pages are distributively stored across the two local 
memories and duplicate copies can exist. As described 
above and now illustrated in FIG. 13, the data can move 
between different local memories and also move, or page, 
between volatile and persistent storage. The data movement 



such as the owner node, can employ its local cache elements, 10 ca n be responsive to data requests made by memory users 



including the local RAM cache and local disk cache to the 
identify from the global address signal a physical location of 
the data stored in the page being accessed. The cache 
element can employ the operating system of the owner node 
to access the memory device that maintains that physical 
location in order that the data stored in the page can be 
accessed. For a read-memory operation, or for other similar 
operations, the data read from the physical memory of the 
owner node can be passed via the network to the memory 

subsystem of the node requesting the read and subsequently 20 any need to change the directory 340. 
stored into the virtual memory space of the requesting node 



like application programs, or by operation of the migration 
controller described above. As also described above, the 
movement of data between different memory locations can 
occur without requiring changes to the directory 340. This is 
15 achieved by providing a directory 340 that is decoupled 
from the physical location of the data by employing a pointer 
to a responsible node that tracks the data storage location. 
Accordingly, although the data storage location can change, 
the responsible node can remain constant, thereby avoiding 



for use by that node. 

With reference again to FIG. 12, it can be seen that the 
depicted directory structure 340 comprises a hierarchical 
structure. To this end, the directory structure 340 provides a 
structure that continually subdivides the memory space into 
smaller and smaller sections. Further, each section is repre- 
sented by directory pages of the same structure, but indexes 
address spaces of different sizes. As pages are created or 
deleted, a linker inserts or deletes the pages from the 
directory. In one embodiment, the linker is a software 
module for linking data structures. The linker can operate 
responsive to the address ranges to provide the depicted 
hierarchical structure. Accordingly, the depicted directory 
340 provides a scaleable directory for the shared address 
space. Moreover, the directory pages are stored in the 
distributed address space and maintained by the distributed 
shared memory system. A root for the directory can be stored 
in known locations to allow for bootstrap of the system. 
Consequently, commonly used pages are copied and 



RECOVERY 

The system and methods described above allow a distrib- 
uted system to share address space, including persistent 
storage for memory, and gracefully handle node failure. 
Since the RAM directory, disk directory, and file system are 
distributed over every node in the distributed shared system, 
failure of one node may leave a "hole" in the RAM directory, 
disk directory, file system, or some combination of the three. 

The systems described throughout rely on two concepts to 
aid memory sharing and fault tolerance. Those concepts are 
quorum and heartbeat. Before describing quorum or 
hearbeat, however, the concept of an anchor node must be 
introduced. 

Anchor nodes are special network nodes that retain a copy 
of the entire node directory database and may provide 
storage for other important system information such as 
copies of the root of disk directory trees. A node is config- 
ured as an anchor node when it is introduced to the network 
and this may be done by setting a value in a configuration file 
distributed, and rarely used pages are shuffled off to disk. present on the node, or the node may be configured as an 
Similarly, directory pages will migrate to those nodes that anchor using hardware techniques such as jumpers or special 
access them most, providing a degree of self-organization cabling. Anchor nodes may also store a complete list of all 
that reduces aetwork traffic. other anchor nodes in the network. Each anchor node may 

FIG. 13 depicts the directory of FIG. 12 being employed 45 be provided with a list of all other anchor nodes by the 
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by a system according to the invention. In particular FIG. 13 
depicts a system 400 that includes two nodes, 406a and 
406b, a directory structure 340, and a pair of local memories 
having volatile memory devices 264a and 2646, and persis- 
tent memory devices 262a and 2626. Depicted node 406a 
includes an address consumer 408c, a global address 410a, 
and interface 242a, a directory manager 244a and a memory 
controller 246a. Node 4066 has corresponding elements. 
The nodes are connected by the network 254. The directory 
340 has a root page, directory pages A-F, and pages 1-5. 

Each node 406a and 4066 operates as discussed above. 
The depicted address consumers 408a and 4086 can be an 
application program, file system, hardware device or any 
other such element that requests access to the virtual 
memory. In operation, the address consumers 408a and 4086 
request an address, or range of addresses, and the directory 
manager can include a global address generator that pro- 
vides the consumer with the requested address, or a pointer 
to the requested address. As addresses get generated, the 
respective directory managers 244a and 2446 generate 
directory pages and store the pages in the directory structure 
340. As depicted, the directory structure 340 tracks the 
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system administrator or, on initialization, each anchor node 
may use a search protocol to locate other anchor nodes. 

Quorum indicates that enough nodes remain functional in 
the system to provide proper data processing and memory 
sharing. Because the number of nodes present in a network 
may be very high, not all nodes participate in the compu- 
tation of quorum. In order to reduce processing 
requirements, only "anchor" nodes participate in the quorum 
computation. In attempting to establish quorum, each anchor 
node may contribute one "vote," If the number of votes 
received is in excess of some predetermined threshold, then 
quorum is established and normal processing is effected. 
Quorum may also be used to gracefully operate when 
network failures result in the partitioning of the network into 
two or more regions. One of the partitions may continue to 
operate (because it is able to establish quorum) while the 
others cannot continue operation. In some embodiments the 
network administrator may assign more than one vote to 
certain anchor nodes in an attempt to bias operation of the 
network towards certain nodes. 

"Heartbeat" refers to the periodic exchange of connectiv- 
ity information between all nodes of the network. One node 
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is assigned to monitor heartbeat information. Heartbeat 
monitors may be assigned on a per network, per partition, or 
per group basis. The identity of the heartbeat monitor is 
dynamically assigned and may, but is not required to, favor 
selection of anchor nodes as the heartbeat monitor. All other 
nodes connected to network are heartbeat "slaves/' which 
means that those nodes report their operating status to the 
heartbeat monitor and receive periodic connectivity updates 
from the monitor. 

Heartbeat information propagates in the following man- 
ner. Each heartbeat slave periodically transmits a member 
pulse to its local heartbeat monitor indicating to the monitor 
that it is functional. When the monitor receives the slave J s 
member pulse, it updates its connectivity information. The 
monitor may store connectivity information as a bitmap, or 
any other data structure which allows such information to be 
stored and transmitted. The monitor periodically broadcasts 
the compiled connectivity information to the heartbeat 
slaves, which will be referred to as a "monitor pulse." 

In the event that a heartbeat slave misses a deadline three 
times in a row for transmitting member pulse information, 
the heartbeat monitor assumes that the errant slave has 
ceased functioning and updates the stored connectivity 
information to reflect the change in status. Each surviving 
heartbeat slave is notified of the change in connectivity on 
the next monitor pulse. In the event the slave is unable to 
transmit information but can receive, the slave will receive 
the broadcasted notification that it is no longer part of the 
network. 

Should the heartbeat monitor miss a deadline three times 
in a row for broadcasting the monitor pulse, each slave 
assumes that the heartbeat monitor has ceased functioning 
and each slave attempts to become the new heartbeat moni- 
tor. A slave may arbitrate to become the heartbeat monitor or 
a configuration file may be created that lists heartbeat 
monitors in order of preference and from which successive 
monitors may be selected. 

Each node's responsibilities depend on whether it is a 
heartbeat monitor, anchor node, or both. Each case is sum- 
marized below. 

Heartbeat monitor and anchor node 

On every connectivity change, i.e., at every deadline for 
receiving member pulses, this node will recompute whether 
quorum exists based solely on its stored connectivity bit- 
map. Resultant quorum state is included in broadcasted 
monitor pulses. Received quorum state information from 
other anchor nodes is ignored. 
Heartbeat monitor but not an anchor node 

This node receives member pulses from all slaves. If a 
member pulse is received from an anchor node, this node 
immediately updates the current quorum state and transmits 
the current quorum state on the next monitor pulse. 
Heartbeat slave and anchor node 

Whenever there is a connectivity change, these nodes 
recompute quorum based solely on the connectivity infor- 
mation received from the heartbeat monitor. These nodes 
include the resultant quorum state in their member pulses. 
Quorum information received from the heartbeat monitor is 
ignored. 

Heartbeat slave and not an anchor node 

These nodes transmit no quorum information. These 
nodes receive updated quorum information from the heart- 
beat monitor's monitor pulses. 

For embodiments where nodes are grouped, each group 
can elect a group heartbeat transmitter. The group heartbeat 
transmitter notifies a group level heartbeat monitor that the 
group it represents is active. The group level heartbeat 
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monitor periodically broadcasts the status of the set of 
groups present in the network. This hierarchical grouping 
can be of arbitrary depth. 

In some embodiments, quorum information is discarded 
5 after a certain period of time. This may be accomplished by 
associating a timer with quorum information (on heartbeat 
monitors and slaves) that is restarted whenever the node 
receives quorum information. Thus, when the last anchor 
node ceases functioning (and therefore ceases transmission 
of quorum information) non-anchors will know they no 
longer have quorum no later than the time-out periods for the 
timer. 

As noted above, anchor nodes maintain a record of the 
current node database, i.e., anchor nodes record the current 
connectivity state of the network. Anchor nodes may store 

15 the node database to disk storage, or some other persistent 
storage mechanism, in order to provide backup during node 
failures. The node database may be written to a specific 
directory location. Updates to the database may be con- 
trolled by a centralized database. When an anchor nodes 

20 prepares to update the database, it may indicate the operation 
it is attempting to perform (add, delete, or change a node), 
data identifying the node for which an entry is changing, and 
the version number of the database that will be used if the 
update is successful. 

25 In networks having more than one anchor node, anchor 
nodes must enter into an arbitration algorithm to perform a 
node database update. An anchor node that initiates the node 
database update assumes the role of "coordinator." The 
coordinator obtains a list of all the anchor nodes currently in 

30 the quorum set and each anchor node in this list assumes the 
roles of a "subordinate" anchor node for the purposes of the 
initiated update. While the database update is in progress, 
the coordinator and subordinates do not allow a second node 
database update to begin. 

35 In networks having a single anchor node, the coordinator 
will obtain a list of anchor nodes that includes only the 
coordinator. In the event that the coordinator retrieves a list 
of anchor nodes that is empty, an error has occurred and the 
update is immediately terminated, in some embodiments the 

40 coordinator's first action is to check for quorum. If no 
quorum exists, then the update is immediately terminated. 
RAM directory recovery 

As described above, global RAM directory (GRD) pages 
are volatile pages that are not backed up to redundant, 

45 reliable, persistent disk storage and are frequently modified. 
These qualities make GRD pages highly vulnerable to node 
failure. Because GRD pages enable the location of other 
GRD pages, losing a GRD page can result in a section of the 
shared memory space becoming unfindable. 

50 I n, brief overview, when a node in the network fai ls, all 
other" nodes cease processing. T he GRD is discarded) sy n- 
c hronously re populated with the contents of the survivi ng 
nodes' local RAM cache, and processing resumes. 

The notions of * ; node state" and "network state" should be 

55 introduced. A. node has four states: normal, notified, 
quiescent, and rebuilding. In the normal state, a node in 
functioning normally. When notified of another node's 
failure, the node enters the "notified" state and waits for all 
its local processing to cease. Once all processing has ceased, 

60 the node enters the quiescent state and discards all GRD 
pages it has cached. When the node receives a "start repopu- 
lating" message from the recovery coordinator, it leaves the 
quiescent state and enters the Rebuilding state. When the 
node reports to the recovery coordinator that it has com- 
es pleted rebuilding, the node re-enters the normal state. 

Network state has similar properties, except that the 
network is in the notification state if any node is in the 
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notification state, the network is in the quiescent state only 
when all nodes are in the quiescent state, the network is in 
the rebuilding state if any nodes are in the rebuilding state, 
and the network is in the normal state when the first node 
returns to the normal state. 

When an anchor node notices that a node has failed (via 
the heartbean mechanism) or receives a request to rebuild 
the GRD from another node that had detected a node failure, 
it enters the "notified" state and negotiates with the other 
anchors to become the recovery coordinator, and thereby 
gain control of the rebuild. The negotiation to control the 
rebuild can rely on many different qualities. For example, 
anchor nodes may negotiate based on identification code, 
with lower assigned identification codes "winning" the 
negotiation. If the anchor node loses the negotiation, it 
defers to the winner, ceases attempting to control the rebuild, 
waits for a "start recovery" message, and proceeds as 
described above. 

Otherwi se, if t he anchor node controls the rebuild, it sends 
a "quiesce for recovery" message to all nodes and waits to 
r eceive all the replies . This can be a synchronous process , 
a lthough it may he~Hesj rah] e for it to be asynchronous to 
ac commodate node of varying response speeds and capa - 
bilities. 

A non-anchor node will first receive the "quiesce for 
recovery" message from the recovery coordinator anchor 
node. This will cause the node to enter the "notified" state. 
Once in the notified state, all local processing activity is 
stopped, and errors should be returned for most received 
remote invocations. This state must either complete all 
invalidations or reliably fail them. Otherwise, a page could 
be modified while disconnected copyholders are outstand- 
ing. Once all local processing has terminated, a reply to the 
"quiesce for recovery" message is sent to the coordinator 
and the node enters the quiescent state. 

During the quiescent state the node removes all GRD 
pages from its local RAM cache, whether dirty or not, 
discards copyset information, and waits for a "start repopu- 
lating" message. 

Once all replies have been received by the recovery 
coordinator, it sends a "start repopulating" message to all 
nodes and waits for their reply that repopulation is complete. 

When the node receives a "start repopulating" message, it 
enters the rebuilding state and sends a reply to the message. 
For any page remaining in the node's local RAM cache, the 
node sends a message to the recovery coordinator identify- 
ing the page by global address and requesting to become the 
responsible node and owner of the page. IF successful, the 
node owns the page. If not successful, the node should drop 
the page from its local RAM cache or register with the new 
owner to become a copychild of the page. 

Once the node has attempted to become owner of every 
page existing in its local RAM cache, it sends a "repopula- 
tion complete" message to the recovery coordinator and 
waits to receive a "resume operations" message. Once every 
node has sent a "repopulation complete" message to the 
recovery coordinator, it sends a "resume operations" mes- 
sage to all nodes in the network. 

If a node fails during the GRD rebuild process, it can 
either be ignored, or the rebuild process can be restarted. 
Disk directory recovery 

Pages of the GDD are stored by multiple nodes in the 
network to provide some degree of tolerance for node 
failure. When a node fails, persistent data and directory 
pages stored by the failed node need to be reduplexed, in 
order to survive subsequent failures. There are two mecha- 
nisms used to perform this reduplexing function. The first is 
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the normal page activation process. Whenever a page is 
activated, the primary core holder checks to ensure that this 
page has not fallen below a minimum number of core 
holders threshold. If it has, the primary core holder is 

5 responsible for finding a new core holder. This mechanism 
is demand driven, i.e., this mechanism results in pages being 
reduplexed when they are explicitly accessed. The second 
mechanism uses a background reduplexer that asynchro- 
nously schedules the activity. 

io One of the anchor nodes present in a network is desig- 
nated as the primary anchor node (PAN). The PAN maintains 
the primary copy of the global disk directory (GDD) root 
page. The PAN is assigned dynamically. Anchor nodes may 
arbitrate to become the PAN in the event of a PAN failure, 

15 or a configuration file may be provided which lists a series 
of nodes that may serve as the PAN. In either case, quorum 
must still exist. Anchor nodes that are not the PAN behave 
in the same manner as non-anchor nodes with respect to the 
reduplexing process. 

20 In order to provide asynchronous GDD recovery, the PAN 
maintains and controls the background reduplexing process. 
The PAN receives notification from other nodes when they 
detect GDD pages that are below minimum core holder 
threshold. These GDD pages are typically detected during 

25 normal directory traversal operations. The PAN reduplexes 
by activating pages, as described above. 

Nodes maintain state regarding GDD pages that have less 
than the threshold number of core holders that they have 
encountered. When such a page is encountered, the node 

30 notifies the PAN to re duplex the page, and then monitors the 
PAN. Should the PAN fail during reduplexing, the node 
waits for the other anchor nodes to select a new PAN (or 
until a new PAN is assigned from a configuration file) and 
transmits the reduplex request to the new PAN. The com- 

35 munication subsystem is utilized to monitor the node states 
for this process. The node state is also used to reinitiate 
reduplexing operations that were incomplete due to resource 
constraints, such as available disk space. 
Various forms of network outages can cause sets of nodes 

40 to become partitioned into separate aggregations. When this 
occurs, duplexed copies of pages may be split between 
aggregations, i.e., one core holder for a page is present in the 
network while another is contained in the aggregation. The 
quorum set, which is a majority of the original set of anchor 

45 nodes, is required for write access to the data and directory 
pages. An aggregation of nodes that are not in the quorum 
set, therefore, may serve data pages but cannot write to 
them. This inability to modify pages without quorum is 
enforced by having the anchor nodes disallow modifications 

50 to the GDD root page when quorum is lost. If a node 
attempts to modify a page without access to all of its core 
holders, it first will attempt to modify the set of core copy 
holders, which requires the page's GDD page to be modi- 
fied. This write to the GDD page will itself require a right to 

55 the GDD that stores its core copyset, and so on, until finally 
the node will attempt to modify the GDD root page. Since 
this operation will be failed by the anchor nodes, the original 
write to a normal data page will fail. The anchor nodes' 
enforcement of quorum on this GDD root page, prevents 

60 data corruption in the presence of a partitioned network. 
File system recovery 

Another result of node failure is the generation of file 
system metadata inconsistencies. File system metadata 
inconsistencies take a number of forms, including (1) incon- 

65 sistency between the file system representation of allocated 
pages and the global disk directory (GDD) of allocated or 
deallocated pages resulting from failure during allocation or 
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deallocation of pages; (2) erroneous file attributes contained 
in a file Inode; (3) inconsistent pages from the file system 
resulting from failure during a transaction that requires 
multiple flush operations to record updates, such as updates 
that span multiple disk blocks (directory updates and fileset 5 
operations); (4) [node directory errors resulting from failures 
of single page and multiple page updates; (5) inconsistency 
between file attributes as stored in the Inode and file 
attributes as stored in a file's Inode and a file's directory 
entry resulting from a failure during the synchronization 1Q 
process. 

File system transactions involving GDD updates are 
transactions that include allocation, deallocation, and 
unallocation, i.e., freeing disk space while keeping address 
space reserved, operations. These transactions require that 
the state of the pages being allocated are updated in the file 15 
system metadata structures and also require the invocation 
of GDD allocation, deallocation, or unallocation functions. 
In order to perform these functions in a manner to allows the 
file system to recover from a node failure, file system pages 
must be associated with a recovery handle, which is trans- 20 
mitted to the GDD when the page is allocated. The recovery 
handle is any identification entity that uniquely identifies a 
page such as the file system object identification code. 

The file system provides a call back routine that can be 
invoked by the GDD to determine the current state of 25 
allocation of specified pages according to the file system. 
For allocation of pages, the file system must invoke the 
GDD allocation function before attempting the transaction. 
After the GJDD allocation successfully completes the file 
system may permanently record the allocation state. If the 30 
transaction fails to complete the GDD may have allocated 
pages that the file system will not recognize. 

For deallocation or unallocation of pages, the file system 
must report permanently the deallocation or unallocation 
before invoking the GDD function to deallocate or unallo- 35 
cate the pages. If the transaction fails to complete, the GDD 
may end up with file system pages that the file system does 
not recognize. 

The GDD may call the file system to verify pages asso- 
ciated with the file system object ID. The GDD passes to the 40 
file system what it perceives the state of allocation to be. The 
file system searches its metadata structures to identify the 
pages and reports back to the GDD only if it disagrees with 
the GDD. 

File Inode updates require updating metadata information 45 
contained in the file's Inode. These attributes may include 
file time stamps, file flags, and the end of file mark. 
Attributes may be updated directly by various set file 
attributes operations or, as in the case of time stamps, they 
may be set indirectly as part of other file system operations. 50 
Since the Inode does not span disk blocks, i.e., the Inode 
occupies a single page, the update either succeeds com- 
pletely or fails completely. Accordingly, the file system will 
not detect inconsistency in the file system metadata. This is 
true for file size updates also because, even though file size 55 
of a primary stream is stored in both the file Inode and the 
data stream descriptor, both of these metadata structures 
reside on the same page in the same disk block. 

The file system metadata updates that span multiple pages 
include directory entry updates, such as create file or 60 
directory, delete file or directory, and rename file or 
directory, file set updates (such as creation and deletion, or 
super root updates or creations). Each of these multiple page 
metadata updates is vulnerable to node failure during the 
transaction and each must be handled in a specific manner. 65 

Directory entry insertion and deletion can require multiple 
flushes involving multiple pages. Depending on the distri- 
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bution on the directory entries, the addition or deletion of a 
file may affect a page containing the entry, the pages 
containing the previous and the next entries in the list, and 
the Inode page containing the name hash table. In the case 
of directory entry insertion, if no free slot can be found in 
existing directory pages, then a new entry page will be 
allocated. 

A transaction to insert or delete a directory entry must first 
mark the directory Inode with the type of transaction. Marks 
may include: create; create_directory; rename; rename tar- 
get; and delete. The Inode number of the file being inserted 
or removed is also stored in the directory Inode. The 
directory Inode is then flushed, which marks the beginning 
of the update transaction. 

Once the directory Inode is flushed, the directory is 
updated with the entry modifications, along with the pages 
containing the previous and next entries in the sort and hash 
fist and the hash table Inode page. These may occur in 
parallel. The transaction mark in the directory Inode is then 
cleared and the Inode written out. This marks the end of the 
transaction. A directory Inode page that has a transaction 
mark indicates that the directory was the target of an 
incomplete transaction and the directory needs to be recov- 
ered. The file system recovers directory Inode pages by 
invoking a function that verifies and repairs individual 
directories. The functions should check and repair any 
inconsistencies found in the directory, such as broken sort/ 
hash linked lists, inconsistent free block linked lists, or 
incorrect hash table bucket entries. If the function fails to 
repair the directory, the directory should be marked as 
corrupt and future access to the directory should be denied. 
Otherwise, the file entry Inode number is extracted from the 
transaction mark and connectivity to the Inode is verified. 
Once verified, the transaction mark in the directory Inode is 
cleared and the directory Inode page is flushed. This marks 
the end of recovery. 

File creation and directory creation involves allocation of 
new Inode pages and, for directories, allocation of the first 
directory entry page, and the subsequent insertion of the new 
file or directory into the parent directory. The file system 
implements file or directory creation by first calling the 
Inode directory function to return a new Inode. The trans- 
action is marked in the parent directory Inode and flushed. 
All data structure updates are performed under this transac- 
tion mark. Once these updates are flushed, the transaction 
mark in the directory Inode is cleared. 

A transaction mark found in the parent Inode (i.e., the 
Inode of the parent directory) indicates that a node failure 
occurred during a transaction. In order to recover from the 
failed file or directory creation transaction, the file or direc- 
tory Inode is located using the transaction mark contained in 
the parent directory Inode. The parent directory is recovered 
in a manner similar described above, but the parent Inode 
transaction mark is not immediately cleared. If the parent 
directory can be repaired or is consistent and the new file 
Inode or the new directory Inode is apparent from the 
transaction mark, then the transaction mark is cleared in the 
parent Inode and flushed. In this case, the transaction is 
complete. Otherwise, the file system must undo the trans- 
action. 

To undo the transaction, the file system must determine if 
the transaction is a create directory or create file transaction. 
If the failed transaction was for the creation of a directory 
and the file system has the pointer to the directory entry page 
in the Inode page, then it may simple deallocate the directory 
entry page. Otherwise, the file system must call the Inode 
directory function to free the Inode, remove the entry from 
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the parent directory if an entry has been created, and then 
clear the transaction mark from the parent Inode and flush it. 

File deletion and directory deletion are performed in two 
phases. In brief overview, the file or directory is first marked 
for deletion and its entries removed from the parent direc- 
tory. Then, when all open handles to the file or the directory 
have been closed, it is physically deleted. 

To recover from a failed file deletion or directory deletion 
transaction, a file system should, during the mark for delete 
phase, set a transaction mark in the file Inode or the directory 
Inode to indicate that it is being deleted. The Inode is then 
flushed. The file entry or directory entry is then removed 
from the parent directory using the general method 
described above. The transaction mark set in the parent 
Inode indicates that a deletion transaction is occurring. Since 
directory look up for the file will now fail, no new file 
handles can be opened. However, existing handles may still 
be used to access the file or directory. 

Once all the existing handles to the file or directory have 
been closed, and if the target of the transaction is a directory, 
then all directory entry pages are deallocated. The Inode is 
lazily returned to a local lookaside list. Inodes are allocated 
and deallocated from the global pool in groups, for perfor- 
mance and scaling reasons. However Inodes may be allo- 
cated and deallocated singly. If a parent directory Inode or 
a file/directory Inode is found with a transaction mark 
indicating an incomplete delete transaction, then the deletion 
must be recovered. 

However, if the parent Inode has a delete transaction 
mark, then the Inode of the file/directory being deleted must 
be located using the transaction mark in the parent Inode. If 
the file/directory Inode indicates that the target of the delete 
is a directory, all directory entry pages must be deallocated. 
The Inode is returned to the local pool, for later disposition. 
Once the removal of the file/directory's entry has been 
verified, the parent Inode's transaction mark may be cleared 
and the parent Inode can be flushed. This step completes the 
recovery of the deletion. 

A rename operation is effected by performing an insert to 
a new directory and a delete from an old directory. However, 
the insertion transaction is not cleared until both the inser- 
tion and deletion successfully complete. This allows the 
system to either "back out" of an unsuccessful rename 
operation or to complete the failed rename operation when 
it is encountered, depending on the amount of progress made 
before failure. 

Fileset creation and deletion also requires multiple page 
flushes which are vulnerable to a node failure during the 
transaction. Fileset creation involves the allocation and 
initialization of the new fileset page, the fileset's Inode 
directory, and the root directory. To recover a failed fileset 
creation, the file system must begin by allocating the nec- 
essary pages to initialize the fileset, the Inode directory, and 
the root directory. Root directory pages are allocated with 
the file system object ID pointing to the root directory and 
fileset related pages are initialized with the fileset ID only. 
Should a node fail at this point, the allocated pages will be 
lost. The super root is then updated to record the new fileset. 
If this step is successful, the fileset creation transaction is 
successful. 

A fileset deletion operation will deallocate the fileset 
page, the Inode directory, and the root directory. This 
transaction begins by marking the fileset as the target of a 
delete transaction. The fileset page is flushed and the root 
directory is deleted. Once the root directory is deleted, all 
free Inodes are deallocated and this step may be repeated as 
many times as necessary. The super root is then updated to 
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remove the fileset from the super root's set of filesets. 
Should a node fail at this time, the fileset page will be lost 
but can be recovered via the GDD callback mechanism. The 
fileset deletion is completed by deleting the fileset pages. 
The Inode directory is a set of persistent data structures used 
to track all the Inodes in a fileset. Using the Inode directory, 
the file system can locate any Inode in the fileset. The Inode 
directory contains two major components (1) the free Inode 
list and (2) the Inode bit map. 

As noted above, file attributes are duplicated in the file 
Inode and its directory entry in order to improve the per- 
formance of directory queries, one of the most important 
performance measures in the file systems. Since any file 
attributes change must be propagated to both structures, the 
number of flushes required to update a file or directory may 
double. Since updates to the mode page and the directory 
entry page are separate, inconsistencies may arise between 
the two. Inconsistencies may be reduced by providing the 
Inode in the directory entry with a synchronization version 
number. The two are synchronized if they have the same 
synchronization version number. Whenever the Inode is 
updated, its synchronization version number is incremented. 
Then, the directory entry page is locked and the file 
attributes in the Inode are copies to the directory entry, 
including the synchronization version number. The entry 
pages is not yet flushed, but the Inode page is flushed 
according to the strategies described above. If at this point 
a node fails, the directory entry in the Inode will not be in 
synchronization with each other. Once the Inode is flushed, 
the entry page is also flushed independently. 

During a file open or directory open, the Inode synchro- 
nization version is compared with the directory entry's 
synchronization version. If they do not match, the directory 
entry is not synchronized with the mode entry and the 
directory entry must be updated. 

Variations, modifications, and other implementations of 
what is described herein will occur to those of ordinary skill 
in the art without departing from the spirit and the scope of 
the invention as claimed. Accordingly, the invention is to be 
defined not by the preceding illustrative description but 
instead by the spirit and scope of the following claims. 

What is claimed is: 

1. In a system for providing distributed control over data, 
a method for continuing operation after a node failure, the 
method comprising: 

(a) providing a plurality of nodes inter-connected by a 
network which periodically exchange connectivity 
information; 

(b) storing on each node an instance of a data control 
program for manipulating data to provide multiple, 
distributed instances of the data control program; 

(c) interfacing each instance of the data control program 
to a distributed shared memory system that provides 
distributed storage across the inter- connected node and 
that provides addressable persistent storage of data; 

(d) operating each instance of the data control program to 
employ the shared memory system as a memory device 
having data contained therein, whereby the shared 
memory system maintains multiple, persistent copies of 
data distributed among more than one network node; 

(e) determining from the exchanged connectivity infor- 
mation the failure of a node; 

(f) determining a portion of the data for which the failed 
node was responsible; and 

(g) storing a copy of the portion of the data for which the 
failed node was responsible in persistent storage hosted 
by a surviving node. 
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2. The method of claim 1 wherein step (a) farther com- 
promises: 

(a-a) providing a plurality of nodes interconnected by a 
network; 

(a-b) designating a heartbeat monitor node from the 
plurality of interconnected nodes; 

wherein the plurality of nodes periodically transmit infor- 
mation to the heartbeat monitor node and the heartbeat 
monitor node periodically broadcasts information to 
the plurality of the nodes. 

3. The method of claim 1 wherein step (a) further com- 
prises: 

(a-a) providing a plurality of nodes interconnected by a 
network, the nodes organized into hierarchical groups 
of nodes; 

(a-b) designating a heartbeat monitor node for each group 
of nodes, 

wherein each node in a group periodically transmits 
information to the heartbeat monitor node for the 
group, each heartbeat monitor node periodically broad- 
casts information to the nodes belonging to its group, 
and the plurality of heartbeat monitor nodes exchange 
information. 

4. The method of claim 2 wherein step (e) comprises 25 
determining the failure of a node from the absence of 
exchanged information from the node. 

5. The method of claim 1 wherein step (d) comprises 
operating each instance of the data control program to 
employ the shared memory system as a memory device 
having data contained therein, whereby the shared memory 
system uses a directory to coordinate access to data stored in 
volatile memory elements associated with each node and 
maintains multiple, persistent copies of data distributed 
among more than one network node. 

6. The method of claim 5 wherein step (f) further com- 
prises: 

(f-a) determining a portion of the data which the failed 
node had stored in its associated volatile memory 
element; and 

(f-b) determining a portion of the data which the failed 
node had stored in its associated persistent memory 
element. 

7. The method of claim 5 wherein step (g) comprises 
rebuilding the volatile memory directory. 

8. In a system for providing distributed control over data, 
a method for continuing operation after a node failure, the 
method comprising: 

(a) providing a plurality of nodes inter-connected by a 
network which periodically exchange connectivity 
information; 

(b) storing on each" node an instance of a data control 
program for manipulating data to provide multiple, 
distributed instances of the data control program; 

(c) interfacing each instance of the data control program 
to a globally addressable data store that provides dis- 
tributed storage across the inter-connected node and 
that provides addressable persistent storage of data; 
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(d) operating each instance of the data control program to 
employ the globally addressable data store as a memory 
device having data contained therein, whereby the 
globally addressable data store maintains multiple, 
persistent copies of data distributed among more than 
one network node; 

(e) determining from the exchanged connectivity infor- 
mation the failure of a node; 

(f) determining a portion of the data for which the failed 
node was responsible; and 

(g) storing a copy of the portion of the data for which the 
failed node was responsible in persistent storage hosted 
by a surviving node. 

9. The method of claim 8 wherein step (a) further com- 
promises: 

(a-a) providing a plurality of nodes interconnected by a 
network; 

(a-b) designating a heartbeat monitor node from the 
plurality of interconnected nodes; 

wherein the plurality of nodes periodically transmit infor- 
mation to the heartbeat monitor node and the heartbeat 
monitor node periodically broadcasts information to 
the plurality of the nodes. 

10. The method of claim 8 wherein step (a) further 
comprises: 

(a-a) providing a plurality of nodes interconnected by a 
network, the nodes organized into hierarchical groups 
of nodes; 

(a-b) designating a heartbeat monitor node for each group 
of nodes, wherein each node in a group periodically 
transmits information to the heartbeat monitor node for 
the group, each heartbeat monitor node periodically 
broadcasts information to the nodes belonging to its 
group, and the plurality of heartbeat monitor nodes 
exchange information. 

11. The method of claim 9 wherein step (e) comprises 
determining the failure of a node from the absence of 
exchanged information from the node. 

12. The method of claim 8 wherein step (d) comprises 
operating each instance of the data control program to 
employ the globally addressable data store as a memory 
device having data contained therein, whereby the globally 
addressable data store uses a directory to coordinate access 
to data stored in volatile memory elements associated with 
each node and maintains multiple, persistent copies of data 
distributed among more than one network node. 

13. The method of claim 12 wherein step (f) further 
comprises: 

(f-a) determining a portion of the data which the failed 
node had stored in its associated volatile memory 
element; and 

(f-b) determining a portion of the data which the failed 
node had stored in its associated persistent memory 
element. 

14. The method of claim 12 wherein step (g) comprises 
rebuilding the volatile memory directory. 
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